Introduction

  • Identify and formulate a regression or a classification problem of interest to you that calls for a data-driven answer. You can start with your domain-specific interests or by looking for an interesting dataset.
  • Aim for at least 5 input variables.
  • Explain your variables briefly

Project description: This project’s purpose is to predict bike traffic by means of weather data. Therefore, we used freely available data on bike traffic collected at 5 bike counting stations in the Rhein-Neuss district in North-Rhine-Westphalia, Germany, and merged it with equally freely available data from the Deutsche Wetterdienst for Düsseldorf, which is geographically the weather station with the highest proximity to the area where the bike traffic was counted (just on the opposite side of the river Rhine).

We will only be using data for the dates on which all of the five counting stations reported some traffic and aggregated these, such that we do not do separated predictions for the stations but rather for the ‘general traffic’ at the corresponding point in time.

The explanatory variables are in most of the cases on an hourly aggregation base, and this is where our regression is taking place. Some are only available on a daily basis – for simplicity, we assume them to be constant throughout the day – and we will use dummies implied by the date (Hour, Month, Weekday) and dummies for national holidays and school holidays.

We warmly recommend to consult our variable list: data/raw/VariableList.xlsx. It provides you with information on the DWD abbreviations, our assigned variable names, the format before and after final processing, measuring units and, if necessary, an explanation.

About the models and performance metrics, they are trained using cross validation. At the end, all CV error is compared to training error. The goal is to keep these close together, indicating the model is not overfitting. The metric for this is MSE, this is due to it being the default for some packages, such as caret identifying Out Of Bag error.

We have so much data that we can also affort to split 80-20, to get performance metrics on fully new data. These metrics: MAE, RMSE and, R2 are all summarised at the end.

The data sources can be inspected at: <project directory>/data/raw/sources.ods. The script contains webscraping, meaning that some informations are always retrieved when you run the code.

Description

  • Collect your dataset(s), explore your data for deficiencies such as missing data and formatting problems and prepare it for modelling.
  • Extensive data collection and preparation yields extra credit but is not mandatory for this coursework.
  • Explore the data via descriptive statistics and visualization.

Data Processing

NOTE: If you do not want to reproduce the whole data preparation process with all steps, you have to ensure that the process_raw_data variable definition is FALSE; you have to change it in the setup. Then, you need to comment out the if(){} condition in the data preparation chunks in lines appr. 115 and 807 (technical note: this had to be commented out such that the output for the html rendering appears throughout the code and not only at the end of the if-loop). Then, only the preprocessed final data – which is available in data/final/ will be loaded, reducing the running time by appr. 5 min.

Ensure process_raw_data exists:

if(!exists('process_raw_data')){
  
  process_raw_data <- TRUE
  load_final_data <- !process_raw_data
  
}
# If data preparation not desired, just load the already saved data.

if(load_final_data) {
data_final_aggr <- readRDS(paste(wd, "data/final/data_final_bike_weather.RDS", sep = "/"))
}

Extensive data preparation:

#  if (process_raw_data) {
# ==============================================================================
# Bike traffic data
# ==============================================================================

# Unzip the zip files
unzip(paste(wd, "data/raw/eco-counter-data.zip", sep = "/"), exdir = "bikedata_dir")
bike_data <- read.csv("bikedata_dir/eco-counter-data.csv", sep = ";")
head(bike_data$Datum)
## [1] "2016-04-04T23:00:00+00:00" "2016-04-05T07:00:00+00:00"
## [3] "2016-04-05T09:00:00+00:00" "2016-04-05T14:00:00+00:00"
## [5] "2016-04-05T17:00:00+00:00" "2016-04-05T18:00:00+00:00"
bike_data$Date <- as.Date(substr(bike_data$Datum, 1, 10))
bike_data$Hour <- as.double(substr(bike_data$Datum, 12, 13))

# Junk information: No variation, useless
unique(substr(bike_data$Datum, 11, 11))
## [1] "T"
unique(substr(bike_data$Datum, 14, 26))
## [1] ":00:00+00:00"
#bike_data$Datum <- NULL
# Machine-readable date
bike_data$Date_Hour <- as.POSIXct(paste(bike_data$Date, " ", bike_data$Hour, ":00", sep = ""), format = "%Y-%m-%d %H:%M")
bike_data$Weekday <- weekdays(bike_data$Date, abbreviate = TRUE)



# Data still is in 10-min frequency. Aggregate on hourly base.
hourly_info <- bike_data[!duplicated(paste(bike_data$Date_Hour, bike_data$Zählstelle)), - which(colnames(bike_data) == "Anzahl")]
hourly_number <- aggregate(data = bike_data, Anzahl ~ Date_Hour + Zählstelle, sum)

bike_data_aggr <- merge(hourly_info, hourly_number,
                        by = c("Date_Hour", "Zählstelle"))
 
aggregate(data = bike_data_aggr, Date_Hour ~ Zählstelle, min)
##     Zählstelle           Date_Hour
## 1     Dormagen 2015-11-16 23:00:00
## 2 Grevenbroich 2015-12-07 23:00:00
## 3       Jüchen 2015-12-03 23:00:00
## 4    Meerbusch 2016-01-25 23:00:00
## 5        Neuss 2015-12-10 23:00:00
aggregate(data = bike_data_aggr, Date_Hour ~ Zählstelle, max)
##     Zählstelle           Date_Hour
## 1     Dormagen 2025-04-28 21:00:00
## 2 Grevenbroich 2025-04-03 21:00:00
## 3       Jüchen 2024-11-14 22:00:00
## 4    Meerbusch 2025-04-28 21:00:00
## 5        Neuss 2025-04-28 21:00:00
table(bike_data_aggr$Zählstelle)
## 
##     Dormagen Grevenbroich       Jüchen    Meerbusch        Neuss 
##        76799        80171        74780        76174        70496
nrow(bike_data) / nrow(bike_data_aggr) # with full data it should be 6.
## [1] 4.748338
# Different data availability times is reason for incomplete data in whole time set
sum(is.na(bike_data$Anzahl))
## [1] 129423
sum(is.na(bike_data_aggr$Anzahl)) # we got rid of all NAs during merging.
## [1] 0
# Plausibility check:
# Daily aggregation
plot(aggregate(data = bike_data_aggr, Anzahl ~ Hour, sum)) # Aggregation is probably

# right when night time has less bikers

# Weekday aggregation
plot(aggregate(data = bike_data_aggr, Anzahl ~ as.factor(Weekday), sum),
     main = 'Weekdaily aggregated bike traffic')

# Monthly aggregation
plot(aggregate(data = bike_data_aggr, Anzahl ~ format(substr(Date, 6,7), format = "%m"), sum),
     main = 'Monthly aggregated bike traffic')

# Days of the year aggregation
plot(aggregate(data = bike_data_aggr, Anzahl ~ as.Date(substr(Date, 6,10), format = "%m-%d"), sum),
     main = 'By calendar day: aggregated bike traffic')

# SIMPLIFICATIONS
colnames(bike_data_aggr)
##  [1] "Date_Hour"   "Zählstelle"  "Id"          "Datum"       "Status"     
##  [6] "Channel.Id"  "Koordinaten" "Date"        "Hour"        "Weekday"    
## [11] "Anzahl"
unique(paste(bike_data_aggr$Zählstelle, bike_data_aggr$Id))
## [1] "Dormagen 100019716"     "Jüchen 100019719"       "Grevenbroich 100019718"
## [4] "Neuss 100019717"        "Meerbusch 100019715"
bike_data_aggr$Id <- NULL # ID not needed
bike_data_aggr$Koordinaten <- NULL # not needed
bike_data_aggr$Datum <- NULL # date has been transformed to useful
unique(bike_data_aggr$Status)
## [1] "raw"      "modified" ""
bike_data_aggr$Status <- NULL # we will not examine quality of measurement
unique(bike_data_aggr$Channel.Id) # not useful for our purpose
##  [1] 104019716 102019716 103019716 101019716 105019716 106019716 104019719
##  [8] 102019719 103019719 101019719 104019718 101019718 103019718 102019718
## [15] 103019717 104019717 101019717 102019717 103019715 102019715 104019715
## [22] 101019715
bike_data_aggr$Channel.Id <- NULL




# ==============================================================================
# Meteorological hourly data - for Düsseldorf (very close to the places that are relevant for us)
# ==============================================================================

# What it interesting: 
# Added manually after inspection of the Metadaten_Parameter_<...>_01078.html
# files which are to be found in the zip files in the data/raw directory,
# or -- after running the code -- in the unpacked zip folders.

var_of_int <- c(Date_Hour = "Date_Hour",
                V_N = "cloud_cover",
                V_TE002 = "tmp_soil",
                DD = "wind_degree",
                FF = "wind_speed",
                FX_911 = "wind_highest_last_h",
                TT = "air_temp",
                R1 = "precip_h_mm",
                RS_ind = "precip_h_indicator",
                WRTR = "precip_type",
                SD_SO = "sunshine_duration",
                RF_STD = "rel_humidity",
                P_STD = "air_pressure",
                V_VV = "visibility",
                WW = "observed_weather")

# List of files
filelist_meteo <- list.files(path = paste(wd, "data/raw", sep = "/"))
# Read in hourly data + NOT the verbal description, will not be of interest
filelist_meteo <- filelist_meteo[intersect(which(grepl("stundenwerte", filelist_meteo)),
                                           which(!grepl("WW", filelist_meteo)))]

# Loop for unzipping data
for(i in 1:length(filelist_meteo)) {
  
  tmp_dir <- paste("tmp/", substr(filelist_meteo[i], 1, 15), sep = "")
  
  unzip_dir <- unzip(paste(wd, "data/raw", filelist_meteo[i], sep = "/"), 
                     exdir = tmp_dir)
  
  list_files_zip <- list.files(tmp_dir)
  relevant_csv <- list_files_zip[which(grepl("produkt", list_files_zip))]
  csv <- read.csv(paste(tmp_dir, relevant_csv, sep = "/"), sep = ";")
    
  probe <- read.csv(paste(tmp_dir, relevant_csv, sep = "/"), sep = ";", fileEncoding = "UTF-8")
  
  # Identify NAs, encoded as -999
  csv[csv == -999] <- NA
  
  # Construct uniform date
  csv$Date_Hour <- as.POSIXct(paste(substr(csv$MESS_DATUM, 1, 4), "-",
                                    substr(csv$MESS_DATUM, 5, 6), "-",
                                    substr(csv$MESS_DATUM, 7, 8), " ",
                                    substr(csv$MESS_DATUM, 9, 10), ":00",
                                    sep = ""),
                              format = "%Y-%m-%d %H:%M")
  
  # Only needed for certain period
  csv <- csv[csv$Date_Hour >= "2015-11-16 23:00:00",]
  csv <- csv[rowSums(is.na(csv)) < ncol(csv),]
  
  # Save only what is relevant for final data, and check no double saving happens.
  keep_cols <- if(!"csv_pass" %in% ls()){
      which(colnames(csv) %in% names(var_of_int))
    } else {
      intersect(which(colnames(csv) %in% names(var_of_int)),
                which(!var_of_int[colnames(csv)] %in% colnames(csv_pass)[-which(colnames(csv_pass) == "Date_Hour")]))
    } 
  
  csv_final <- csv[, keep_cols]
  
  # When no relevant var_of_interest in the table:
  # No further processing
  if(is.null(ncol(csv_final))) {next}
  
  
  colnames(csv_final) <- var_of_int[colnames(csv_final)]
  
  # If it does not exist -> initiate the data frame
  # If it exists -> merge the new columns with the existing ones
  if(i == 1) {
    csv_pass <- csv_final
  } else {
    csv_pass <- merge(csv_final, csv_pass, all.x = TRUE, all.y = TRUE, by = "Date_Hour")
  }

  # Bring it together with the hourly bike data 
  weather_data_final <- merge(csv_pass, bike_data_aggr, by = "Date_Hour")
  


}


length(which(weather_data_final$wind_degree == 360))
## [1] 7302
colnames(weather_data_final)
##  [1] "Date_Hour"           "visibility"          "air_pressure"       
##  [4] "rel_humidity"        "air_temp"            "sunshine_duration"  
##  [7] "precip_h_mm"         "precip_type"         "wind_highest_last_h"
## [10] "wind_speed"          "wind_degree"         "tmp_soil"           
## [13] "cloud_cover"         "Zählstelle"          "Date"               
## [16] "Hour"                "Weekday"             "Anzahl"
# Just for test whether data can already be usefully used
first_lm <- lm(data = weather_data_final[weather_data_final$Zählstelle == "Meerbusch",], Anzahl ~ air_temp + as.factor(Hour) + as.factor(Weekday) + wind_speed)
summary(first_lm)
## 
## Call:
## lm(formula = Anzahl ~ air_temp + as.factor(Hour) + as.factor(Weekday) + 
##     wind_speed, data = weather_data_final[weather_data_final$Zählstelle == 
##     "Meerbusch", ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -92.17 -15.77  -2.99  10.56 430.38 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -16.93149    0.66844 -25.330  < 2e-16 ***
## air_temp                1.99640    0.01639 121.787  < 2e-16 ***
## as.factor(Hour)1        0.46875    0.78555   0.597 0.550702    
## as.factor(Hour)2        1.05772    0.78623   1.345 0.178533    
## as.factor(Hour)3        3.18931    0.78571   4.059 4.93e-05 ***
## as.factor(Hour)4        9.51088    0.78567  12.105  < 2e-16 ***
## as.factor(Hour)5       21.04927    0.78531  26.804  < 2e-16 ***
## as.factor(Hour)6       26.01314    0.78533  33.124  < 2e-16 ***
## as.factor(Hour)7       22.99559    0.78580  29.264  < 2e-16 ***
## as.factor(Hour)8       24.83627    0.78674  31.569  < 2e-16 ***
## as.factor(Hour)9       32.87642    0.78810  41.716  < 2e-16 ***
## as.factor(Hour)10      39.17158    0.79009  49.578  < 2e-16 ***
## as.factor(Hour)11      41.86622    0.79212  52.853  < 2e-16 ***
## as.factor(Hour)12      44.80531    0.79372  56.450  < 2e-16 ***
## as.factor(Hour)13      48.14569    0.79422  60.620  < 2e-16 ***
## as.factor(Hour)14      50.55882    0.79449  63.637  < 2e-16 ***
## as.factor(Hour)15      50.11603    0.79375  63.138  < 2e-16 ***
## as.factor(Hour)16      43.56313    0.79215  54.994  < 2e-16 ***
## as.factor(Hour)17      28.60312    0.79045  36.186  < 2e-16 ***
## as.factor(Hour)18      13.76862    0.78843  17.463  < 2e-16 ***
## as.factor(Hour)19       4.78845    0.78678   6.086 1.16e-09 ***
## as.factor(Hour)20       0.52952    0.78584   0.674 0.500425    
## as.factor(Hour)21      -0.55737    0.78535  -0.710 0.477886    
## as.factor(Hour)22      -0.51252    0.78525  -0.653 0.513965    
## as.factor(Hour)23      -0.39521    0.78528  -0.503 0.614772    
## as.factor(Weekday)Mon   1.50457    0.42404   3.548 0.000388 ***
## as.factor(Weekday)Sat   7.13426    0.42359  16.842  < 2e-16 ***
## as.factor(Weekday)Sun  18.03232    0.42393  42.536  < 2e-16 ***
## as.factor(Weekday)Thu   1.80446    0.42359   4.260 2.05e-05 ***
## as.factor(Weekday)Tue   1.53088    0.42315   3.618 0.000297 ***
## as.factor(Weekday)Wed   2.88904    0.42324   6.826 8.81e-12 ***
## wind_speed             -2.21261    0.05039 -43.911  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.05 on 65786 degrees of freedom
##   (126 observations deleted due to missingness)
## Multiple R-squared:  0.4552, Adjusted R-squared:  0.455 
## F-statistic:  1773 on 31 and 65786 DF,  p-value: < 2.2e-16
unique(weather_data_final$Zählstelle)
## [1] "Dormagen"     "Jüchen"       "Grevenbroich" "Neuss"        "Meerbusch"
# What is most busy station
aggregate(weather_data_final, Anzahl ~ Zählstelle, sum)
##     Zählstelle  Anzahl
## 1     Dormagen  346425
## 2 Grevenbroich  388756
## 3       Jüchen  609886
## 4    Meerbusch 1606697
## 5        Neuss 1059233
# Strong wind -> correlation with some direction?
plot(aggregate(weather_data_final, wind_highest_last_h ~ wind_degree, mean),
     main = 'Relation between wind and wind speed')

colSums(is.na(weather_data_final))
##           Date_Hour          visibility        air_pressure        rel_humidity 
##                   0                 290                 922                 922 
##            air_temp   sunshine_duration         precip_h_mm         precip_type 
##                 257               85525                 426               94200 
## wind_highest_last_h          wind_speed         wind_degree            tmp_soil 
##                1008                 360                 360              338403 
##         cloud_cover          Zählstelle                Date                Hour 
##                 313                   0                   0                   0 
##             Weekday              Anzahl 
##                   0                   0
# Look whether the NAs in sunshine duration come from reporting NA instead of 
# 0 at night
length(intersect(which(is.na(weather_data_final$sunshine_duration)),
                 which(as.numeric(substr(weather_data_final$Date_Hour, 12, 13)) %in% c(21, 22, 23, 0, 1, 2)) ))
## [1] 70448
# When data is not available and it is at night, then assume 0min of sunshine
weather_data_final$sunshine_duration[intersect(which(is.na(weather_data_final$sunshine_duration)),
                 which(as.numeric(substr(weather_data_final$Date_Hour, 12, 13)) %in% c(21, 22, 23, 0, 1, 2)) )] <- 0


# Plausibility checks
# Yearly mean temperatures
plot(aggregate(weather_data_final, air_temp ~ (substr(Date_Hour, 1, 4)), mean), type = "l",
     main = 'Mean temperatures by year')

# Mean temperature at the days of the year
plot(aggregate(weather_data_final, air_temp ~ as.Date(substr(Date_Hour, 6, 10), format = "%m-%d"), mean), type = "l",
     main = 'Mean temperature per calendar days')

# Bike traffic at calendar dates
plot(aggregate(bike_data_aggr, Anzahl ~ as.Date(substr(Date_Hour, 6, 10), format = "%m-%d"), sum),
     main = 'Aggregated bike traffic at calendar dates')

# Give english column name for number of bikes
colnames(weather_data_final) <- gsub("Anzahl", "Bikes_Counted", colnames(weather_data_final))





# ==============================================================================
# More detailed meteorological data on daily aggregation
# ==============================================================================

# More variables are available if we are not only looking at hourly, but daily
# aggregation basis.

# csv for structure inspection
# https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/subdaily/standard_format/formate_kl.html

# Can be obtained from:
data_subdaily <- read.csv(file = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/subdaily/standard_format/kl_10400_00_akt.txt")[1,]
# Visually, this is a mess!

# Automatical column assignment
# Use html meta-file for more rapid data identification -- position in string is required
# Structure of html file allows for quick identification of position in string for 
# relevant variables
library(rvest)
## Warning: package 'rvest' was built under R version 4.4.3
url <- "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/subdaily/standard_format/formate_kl.html"
page <- read_html(url)
table <- page |> html_table(fill = TRUE)
table_html <- table[[1]]
colnames(table_html) <- table_html[1,]
table_html <- table_html[-1,]

table_html <- data.frame(Label = table_html$Label, 
                         start_pos = as.numeric(table_html$Pos))
# Find out difference between starting points of variable entries
width_of_col <- diff(table_html$start_pos)
# Define widths of the separated variable entry 'columns'
width_of_col <- c(width_of_col, (nchar(data_subdaily) - sum(width_of_col)))

# This is fwf data:
# One string, and only by providing the column lengths it can split it up
# We know the widths now, as well as the relevant variables.
data_subdaily <- read.fwf("https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/subdaily/standard_format/kl_10393_00_akt.txt",
                 widths = width_of_col, stringsAsFactors = FALSE,
                 col.names = table_html$Label)

#data_subdaily[1, 1:10]
# Create machine-readible dates from the data columns
data_subdaily$Date_Year <- as.Date(paste(data_subdaily$JA, 
                                         sprintf("%02d", data_subdaily$MO), # paste as 2digit
                                         sprintf("%02d", data_subdaily$TA), 
                                         sep = "-"))
# Filter only relevant time period
data_subdaily <- data_subdaily[data_subdaily$Date_Year >= "2015-11-16",]
 

# Select only relevant columns -> cf. https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/subdaily/standard_format/formate_kl.html

subdaily_var_of_int <- c(Date_Year = "Date_Year",
                         SDK = "sunshine_dur_day",
                         E1 = "soilstate_7am",
                         VAK = "type_precip",
                         SHK = "snow_cover_cm")

# Find which are available AND of interest, only keep these
data_subdaily <- data_subdaily[, which(colnames(data_subdaily) %in% names(subdaily_var_of_int))]
# Assign human-interpretable colnames
colnames(data_subdaily) <- subdaily_var_of_int[colnames(data_subdaily)]

# Identify NAs -> again, encoded as -99
data_subdaily[data_subdaily == -99] <- NA

# Plausability analysis: Snow coverage by calendar date
plot(aggregate(data_subdaily, snow_cover_cm ~ as.Date(substr(Date_Year, 6, 10), format = "%m-%d"), mean), type = "l",
     main = 'Plausibility check for snow coverage by calendar date')

# Prepare for hourly merging -> clone each row 24 times and create artificial
# time column
data_subdaily <- data_subdaily[rep(1:nrow(data_subdaily), each = 24),]
data_subdaily$Hour <- rep(sprintf("%02d", 0:23), times = (nrow(data_subdaily) / 24))
data_subdaily$Date_Hour <- paste(data_subdaily$Date_Year, " ", data_subdaily$Hour, ":00:00", sep = "") |> as.POSIXct(format = "%Y-%m-%d %H:%M")
data_subdaily$Date_Year <- NULL; data_subdaily$Hour <- NULL


# Merge with final data
weather_data_PRELIM <- merge(weather_data_final, data_subdaily, by = "Date_Hour", all.x = TRUE)

# CHECK whether the new df WITHOUT the daily data corresponds exactly
# to the df before to exclude some merging fails
identical(weather_data_final[, intersect(colnames(weather_data_final), colnames(weather_data_PRELIM))], 
          weather_data_PRELIM[, intersect(colnames(weather_data_final), colnames(weather_data_PRELIM))])
## [1] TRUE
# Delete tmp version
weather_data_final <- weather_data_PRELIM; weather_data_PRELIM <- NULL




# =========================================
# Holiday data
# =========================================

# Use webscraping!

# Initiate: Look for years, prepare empty vector
years2check <- unique(substr(weather_data_final$Date_Hour, 1, 4))
holidays <- c()

for(i in 1:length(years2check)) {
  
  # Make use of structured urls
  address2check <- paste("https://www.schulferien.org/Kalender_mit_Ferien/kalender_", years2check[i], "_ferien_Nordrhein_Westfalen.html", sep = "")
  
  page <- read_html(address2check)
  text <- page |> html_nodes(".feiertag_liste_row") |> html_text()
  # holidays are stored always in that node!
  # for compatability, translation of Mär to Mrz is needed
  # Make use of unique structure and retrieve day, month, year
  dates <- paste(gsub("Mär", "Mrz", substr(text, 16, 22)), substr(text, 38, 41), sep = " ") |> as.Date(format = "%d. %b %Y", locale = "de_DE")
  
  # add to holiday vector
  holidays <- c(holidays, dates)
  
}

# Re-convert from numeric to actual date
holidays <- holidays |> as.Date(origin = "1970-01-01") 



# Do the same for school holidays
school_holidays <- c()

for(i in 1:length(years2check)) {

  address2check <- paste("https://www.schulferien.org/Kalender_mit_Ferien/kalender_", years2check[i], "_ferien_Nordrhein_Westfalen.html", sep = "")

  page <- read_html(address2check)
  
  # holiday data most easily retrieved like this
  text_raw <- page |> html_nodes("td") |> html_attr("data-tip-text")
  text_raw <- text_raw[-which(is.na(text_raw))]
  # Delete superficial text information
  text_raw <- gsub("</span><br>Nordrhein-Westfalen  <br>", " ", gsub("<span class=\"tmv_tooltip_title\">", "", text_raw))
  
  # Look for structured information which indicates a timespan between two dates
  school_holidays_year <- unlist(regmatches(text_raw, gregexpr(text_raw, pattern = "\\d{2}\\.\\d{2}\\.\\d{4} - \\d{2}\\.\\d{2}\\.\\d{4}"))) |> unique()
  
  hol_in_y <- c()
  
  # Only beginning and end date known until now, we want all days in between as well
  for(j in 1:length(school_holidays_year)) {
    obj <- school_holidays_year[j]
    date1 <- as.Date(substr(obj, 1, 10), format = "%d.%m.%Y")
    date2 <- as.Date(substr(obj, 14, nchar(obj)), format = "%d.%m.%Y")
    holiday_seq <- date1:date2
    hol_in_y <- c(hol_in_y, holiday_seq)
  }
  
  # Store values in the array
  school_holidays <- c(school_holidays, hol_in_y)
  

}

# Again, machine readible date needed.
school_holidays <- school_holidays |> as.Date(origin = "1970-01-01")



# Add to final data set: As dummies is most useful!
# Check for each date whether it is in the holidays or school_holidays array
weather_data_final$Holiday <- ifelse(weather_data_final$Date %in% holidays, 1, 0)
# Plausibility check
table(weather_data_final$Holiday)
## 
##      0      1 
## 333891   4512
weather_data_final$School_holidays <- ifelse(weather_data_final$Date %in% school_holidays, 1, 0)
# Plausibility check
table(weather_data_final$School_holidays)
## 
##      0      1 
## 258535  79868
# Final cleaning
# Which columns have more than 5% NA still?
which(colSums(is.na(weather_data_final)) > 0.05 * nrow(weather_data_final))
##   precip_type      tmp_soil snow_cover_cm 
##             8            12            22
# DELETE
unique(weather_data_final$tmp_soil)
## [1] NA
weather_data_final$tmp_soil <- NULL

head(weather_data_final$precip_type, 20) 
##  [1]  6 NA  6  6 NA  6  6 NA  6  6 NA  6  0 NA  0  6 NA  0  0 NA
# Every third element seems to be missing. Check that:
aggregate(is.na(weather_data_final$precip_type), by = list(rep(1:3, length.out = nrow(weather_data_final)),
                                                           weather_data_final$Zählstelle),
          FUN = sum)
##    Group.1      Group.2    x
## 1        1     Dormagen 6660
## 2        2     Dormagen 6443
## 3        3     Dormagen 6218
## 4        1 Grevenbroich 6417
## 5        2 Grevenbroich 6460
## 6        3 Grevenbroich 6251
## 7        1       Jüchen 6470
## 8        2       Jüchen 6409
## 9        3       Jüchen 6245
## 10       1    Meerbusch 5828
## 11       2    Meerbusch 6019
## 12       3    Meerbusch 5725
## 13       1        Neuss 6398
## 14       2        Neuss 6620
## 15       3        Neuss 6037
# Third of value missing.
sum(is.na(weather_data_final$precip_type)) / nrow(weather_data_final)
## [1] 0.2783663
# Why is that? There is a reason:
# cf. rmd/tmp/stundenwerte_RR/Metadaten_Parameter_rr_stunde_01078.html:
# not every hour reported! some are missing, e. g. 3, 6, 9 ...
# For simplicity: Just assume value from previous hour. (might not always be correct,
# but omitting roughly every third row would not be adequate in that setting)
# "Forward filling"
for(k in 1:nrow(weather_data_final)) {
  if(is.na(weather_data_final$precip_type[k])) {
    weather_data_final$precip_type[k] <- weather_data_final$precip_type[k - 1]
  }
}

# Look how many still NA
sum(is.na(weather_data_final$precip_type))
## [1] 0
# Snow cover: missing
length(which(is.na(weather_data_final$snow_cover_cm)))
## [1] 35134
summer_and_no_snow <- intersect(which(is.na(weather_data_final$snow_cover_cm)),
          which(as.double(substr(weather_data_final$Date, 6, 7)) %in% 4:10)) 
length(summer_and_no_snow)
## [1] 35038
# For those days which are in summer and have NA for snow coverage -> Assume 0
weather_data_final$snow_cover_cm[summer_and_no_snow] <- 0


(colSums(is.na(weather_data_final)) / nrow(weather_data_final) * 100)  |> sort(decreasing = T)
##   sunshine_duration wind_highest_last_h        air_pressure        rel_humidity 
##          4.45533875          0.29786970          0.27245621          0.27245621 
##         precip_h_mm          wind_speed         wind_degree         cloud_cover 
##          0.12588541          0.10638204          0.10638204          0.09249327 
##          visibility            air_temp       snow_cover_cm           Date_Hour 
##          0.08569664          0.07594495          0.02836854          0.00000000 
##         precip_type          Zählstelle                Date                Hour 
##          0.00000000          0.00000000          0.00000000          0.00000000 
##             Weekday       Bikes_Counted    sunshine_dur_day       soilstate_7am 
##          0.00000000          0.00000000          0.00000000          0.00000000 
##         type_precip             Holiday     School_holidays 
##          0.00000000          0.00000000          0.00000000
# After this: only moderate share of NAs in every column (max 0.3%), neglect this



# Look whether data-specific NAs still exist and were not accounted for yet, 
# encoded with i. a. -9, -999 or so -> NAs could still be encoded as negatives!
# Check for every column how many are encoded with DWD-NA-codes
colSums(weather_data_final == -9 | weather_data_final == -99 | weather_data_final == -999) |>  sort(decreasing = T)
##    soilstate_7am      type_precip        Date_Hour      precip_type 
##           152587           152587                0                0 
##       Zählstelle             Date             Hour          Weekday 
##                0                0                0                0 
##    Bikes_Counted sunshine_dur_day          Holiday  School_holidays 
##                0                0                0                0
# Check at discrete variable outcomes
table(weather_data_final$soilstate_7am)
## 
##     -9      0      1      2      4      5      7      8 
## 152587  39000 112957   5325  17302    360   8688   2184
table(weather_data_final$type_precip)
## 
##     -9      0      1      2      3      5      6      7      8     10     15 
## 152587  72223  92504   3959   1320   4920   2472    360   2208   2730    120 
##     21     25     26     30 
##   1128    696    480    696
# Not-reporting of soil state could be season-related
aggregate(weather_data_final$soilstate_7am == -9, by = list(substr(weather_data_final$Date, 6, 7)), FUN = sum)
##    Group.1     x
## 1       01 10007
## 2       02  8736
## 3       03 11065
## 4       04 13550
## 5       05 14136
## 6       06 13680
## 7       07 14136
## 8       08 14083
## 9       09 13678
## 10      10 13628
## 11      11 13161
## 12      12 12727
# It is not! Maybe year...
aggregate(weather_data_final$soilstate_7am == -9, by = list(substr(weather_data_final$Date, 1, 4)), FUN = sum)
##   Group.1     x
## 1    2015     0
## 2    2016     0
## 3    2017     0
## 4    2018     0
## 5    2019     0
## 6    2020 33856
## 7    2021 41817
## 8    2022 43790
## 9    2023 33124
# Visually:
plot(weather_data_final$Date_Hour, ifelse(weather_data_final$soilstate_7am == -9, 1, 0),
     main = 'Check of data availability of soilstate')

# Apparently, soil state not reported any more after 2020 not reported any more => what to do?

# SAME check for type_precip...
# Not-reporting could be season-related
aggregate(weather_data_final$type_precip == -9, by = list(substr(weather_data_final$Date, 6, 7)), FUN = sum)
##    Group.1     x
## 1       01 10007
## 2       02  8736
## 3       03 11065
## 4       04 13550
## 5       05 14136
## 6       06 13680
## 7       07 14136
## 8       08 14083
## 9       09 13678
## 10      10 13628
## 11      11 13161
## 12      12 12727
# It is not! Maybe year...
aggregate(weather_data_final$type_precip == -9, by = list(substr(weather_data_final$Date, 1, 4)), FUN = sum)
##   Group.1     x
## 1    2015     0
## 2    2016     0
## 3    2017     0
## 4    2018     0
## 5    2019     0
## 6    2020 33856
## 7    2021 41817
## 8    2022 43790
## 9    2023 33124
plot(weather_data_final$Date_Hour, ifelse(weather_data_final$type_precip == -9, 1, 0),
     main = 'Check for data availability of precipitation type')

# Apparently, not reported any more after 2020 not reported any more => what to do?

# DECISION TO TAKE: either drop, or subset on years which (first would be more reasonable) 
# => NOTE: Final drop will be performed later



# Plausibility checks again: mean of Bikes per weekdays
aggregate(weather_data_final, Bikes_Counted ~ Weekday, mean)
##   Weekday Bikes_Counted
## 1     Fri      9.508385
## 2     Mon      9.750259
## 3     Sat     13.292817
## 4     Sun     20.246789
## 5     Thu     10.163430
## 6     Tue      9.612277
## 7     Wed     10.388600
plot(aggregate(weather_data_final, air_temp ~ lubridate::month(weather_data_final$Date_Hour), mean))

# Check whether general difference in bike traffic intensity across stations
aggregate(weather_data_final, Bikes_Counted ~ Zählstelle, mean)
##     Zählstelle Bikes_Counted
## 1     Dormagen      4.886107
## 2 Grevenbroich      5.515678
## 3       Jüchen      8.847134
## 4    Meerbusch     24.364567
## 5        Neuss     17.045638
# Check for perfect collinearity -> we need to avoid this
weather_data_final |> dplyr::select(where(is.numeric)) |> na.omit() |> cor() |> 
  heatmap(Rowv = NA, Colv = NA)

# Aggregate all stations (check for differing availability, timespan when all are available)
# Instead of running 5 different models or some station dummy, we will predict more on
# the overall bike traffic in this district.
# Therefore, we aggregate ALL stations for the point in time when ALL are available,
# such that there are no skewed sums due to structural unavailabilities.
mins <- c()
for(station in unique(weather_data_final$Zählstelle)) {
  newmin <- min((weather_data_final |> subset(Zählstelle == station))$Date_Hour)
  mins <- mins |> c(newmin)
}

# Find the latest beginner
max(mins)
## [1] 1453759200
maxs <- c()
for(station in unique(weather_data_final$Zählstelle)) {
  newmax <- max((weather_data_final |> subset(Zählstelle == station))$Date_Hour)
  maxs <- maxs |> c(newmax)
}
# Find the earliest drop-out station
min(maxs)
## [1] 1673730000
# Check how many stations available at every point in time
table_availability <- aggregate(weather_data_final, Zählstelle ~ Date_Hour, FUN = function (x) length(unique(x)))

# Find the Hours and Dates at which ALL STATIONS are available!
allstationsavailable_time_hour <- table_availability$Date_Hour[which(table_availability$Zählstelle == max(table_availability$Zählstelle))]

# Earliest and latest common date
allstationsavailable_time_hour[1]; allstationsavailable_time_hour[length(allstationsavailable_time_hour)]
## [1] "2016-01-25 23:00:00 CET"
## [1] "2023-01-14 22:00:00 CET"
# ONLY keep dates where all are available!
weather_data_final <- weather_data_final[which(weather_data_final$Date_Hour %in% allstationsavailable_time_hour),]
# roughly 50,000 obs lost... But still more than enough :)



# NOW: split up again the data (because decision to sum up all stations was taken only at this point,
# and we did not want to change again from the beginning with potential code problems after this change) 
# into weather- and observational data, remerge it later.

bikes_only <- weather_data_final[, c("Date_Hour", "Zählstelle", "Bikes_Counted")]
weather_only <- weather_data_final[, c("Date_Hour", setdiff(colnames(weather_data_final), colnames(bikes_only)))]

# Aggregate the bike traffic from all stations for every hour of common availability
bikes_aggr <- aggregate(bikes_only, Bikes_Counted ~ Date_Hour, sum)
# Drop duplicates -> every hour needs to be documented only once
# All hourly data for one particular hour are redundant anyway
weather_only <- unique(weather_only)

# As expected, the remaining data of only the bikes and the deletion of the duplicate rows
# in the weather data lead to the same data length.
# We have ONE row for the weather at each date/hour combination,
# and ONE row for the bike traffic at each.
nrow(bikes_aggr) == nrow(weather_only)
## [1] TRUE
# We can prettily bring it together now.
data_final_aggr <- merge(bikes_aggr, weather_only, by = "Date_Hour")



# Decision on deletion of variables with little availability
length(which(data_final_aggr$soilstate_7am == -9)) / nrow(data_final_aggr)
## [1] 0.3851679
length(which(data_final_aggr$type_precip == -9)) / nrow(data_final_aggr)
## [1] 0.3851679
# Very high unavailability... This is a problem.
# Do these variables explain anything, looked at all alone?
summary(lm(data_final_aggr, 
           formula = Bikes_Counted ~ as.factor(soilstate_7am) + as.factor(type_precip))
        )$r.squared
## [1] 0.04306952
# Either delete 38% of data or two variables which jointly explain 4% of variation?
# More reasonable to delete those variables.
data_final_aggr$soilstate_7am <- NULL
data_final_aggr$type_precip <- NULL

# How many NA left? Final check...
nrow(data_final_aggr) - nrow(na.omit(data_final_aggr))
## [1] 2800
sum(na.omit(data_final_aggr) == -9)
## [1] 0
# Only 436 lines will be deleted + no coded NAs any more.
# This is good! Now we can just delete the NA rows and have a pretty fully available
# dataset
data_final_aggr <- data_final_aggr |> na.omit()

# Delete useless info for final data set:
colnames(data_final_aggr)
##  [1] "Date_Hour"           "Bikes_Counted"       "visibility"         
##  [4] "air_pressure"        "rel_humidity"        "air_temp"           
##  [7] "sunshine_duration"   "precip_h_mm"         "precip_type"        
## [10] "wind_highest_last_h" "wind_speed"          "wind_degree"        
## [13] "cloud_cover"         "Date"                "Hour"               
## [16] "Weekday"             "sunshine_dur_day"    "snow_cover_cm"      
## [19] "Holiday"             "School_holidays"
# Can be re-constructed if needed later
data_final_aggr$Date <- NULL


# ====================================================================
# Reformatting
# ====================================================================

# Now, we want to change the variables to their most common/understandable
# units and assign them their most adequate type.

#class(data_final_aggr$Holiday)

# Visibility in km instead of m
data_final_aggr$visibility <- data_final_aggr$visibility / 1000

unique(data_final_aggr$precip_type)
## [1] 0 6 8 7 2 4
# Wind direction -- 8-fold separation instead of degree as numeric
# We do not want 36 different dummies either
unique(data_final_aggr$wind_degree) |> sort()
##  [1]   0  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180
## [20] 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360
wind_degr_dir <- data.frame(degr = seq(0, 360, by = 10))
wind_degr_dir$dir <- c(rep("N", 3), 
                       rep("NE", 4),
                       rep("E", 5),
                       rep("SE", 4),
                       rep("S", 5),
                       rep("SW", 4), 
                       rep("W", 5),
                       rep("NW", 4),
                       rep("N", 3))
# Assigns the right verbal description (e.g. NE = North-East and so on)
winddir_assign <- array(wind_degr_dir$dir, dimnames = list(wind_degr_dir$degr))

data_final_aggr$wind_dir <- winddir_assign[as.character(data_final_aggr$wind_degree)]
# check if assignment successful:
unique(data_final_aggr[, c("wind_dir", "wind_degree")])
##     wind_dir wind_degree
## 1          S         180
## 3         SW         220
## 4         SW         240
## 5         SW         210
## 12         S         200
## 14         S         190
## 16        SW         230
## 53         W         280
## 54        NW         320
## 55         W         260
## 57         W         290
## 58         W         250
## 234       NW         300
## 235       NW         330
## 236        N         340
## 237        N          20
## 238        N         360
## 245        S         170
## 247       SE         140
## 248       SE         150
## 251        S         160
## 350       SE         130
## 356        W         270
## 425       SE         120
## 428        E          80
## 429       NE          60
## 433       NE          50
## 435        N           0
## 441        E         110
## 442        E          70
## 451        E         100
## 456       NE          40
## 484        N         350
## 495        N          10
## 518       NE          30
## 526        E          90
## 548       NW         310
# Worked well
data_final_aggr$wind_degree <- NULL
data_final_aggr$wind_dir <- data_final_aggr$wind_dir |> as.factor()


# precip_type
class(data_final_aggr$precip_type) # needs to be treated as factor
## [1] "integer"
data_final_aggr$precip_type <- data_final_aggr$precip_type |> as.factor()

# cloud_cover
class(data_final_aggr$cloud_cover) # makes more sense in dummy interpretation
## [1] "integer"
data_final_aggr$cloud_cover <- data_final_aggr$cloud_cover |> as.factor()

# hour dummy
class(data_final_aggr$Hour) # also more sense as factor, because no linear trend throughout day
## [1] "numeric"
data_final_aggr$Hour <- as.factor(data_final_aggr$Hour)

# weekday
class(data_final_aggr$Weekday) # transform to factor
## [1] "character"
data_final_aggr$Weekday <- data_final_aggr$Weekday |> as.factor()

# make holiday dummies factors as well
data_final_aggr$Holiday <- data_final_aggr$Holiday |> as.factor()
data_final_aggr$School_holidays <- data_final_aggr$School_holidays |> as.factor()

# Two times sun cover: once on hourly base, and on daily base. check whether identical...
# aggregate(data_final_aggr, sunshine_dur_day ~ substr(Date_Hour, 1, 10), FUN = mean)[1:20]
# aggregate(data_final_aggr, sunshine_duration ~ substr(Date_Hour, 1, 10), FUN = sum)[1:20]
c(min(data_final_aggr$sunshine_dur_day), max(data_final_aggr$sunshine_dur_day))
## [1]   0 160
unique(data_final_aggr$sunshine_dur_day) |> head(50)
##  [1]   0  33  50  14  41  39  47  18  40  53  11  26   1  36  84   3  23  16  21
## [20]  30  74  59   4  19  86  10  48  72   2  90 101 114  58   5  12   8  88  81
## [39]  89 122  87  35  85  71  42  96  34   7  44  54
# no point in the second variable: delete ist
data_final_aggr$sunshine_dur_day <- NULL

# Plausability analysis for other sunshine duration
sunsh_per_day <- aggregate(data_final_aggr, sunshine_duration ~ substr(Date_Hour, 1, 10), FUN = sum)
plot(1:366, (aggregate(sunsh_per_day, sunshine_duration ~ substr(`substr(Date_Hour, 1, 10)`, 6, 10), max)[,2])/60,
     main = "Plausibility check: Sunshine duration per cal. date")

# YES, naturally limited seasonality pattern recognizable.

# Inspect precip_type, because 2 is not specified in online source
# cf. https://www.dwd.de/DE/leistungen/klimadatendeutschland/beschreibung_tagesmonatswerte.html
# Tabelle NIEDERSCHLAGSHOEHE_IND (30 May 2025)
plot(aggregate(data_final_aggr, precip_h_mm ~ precip_type, mean))

table(data_final_aggr$precip_type) # code 2 unimportant, only 6 occ.
## 
##     0     2     4     6     7     8 
## 42852     6    58 11168   197   281
# Check whether types are as needed:
sapply(data_final_aggr, class)
## $Date_Hour
## [1] "POSIXct" "POSIXt" 
## 
## $Bikes_Counted
## [1] "integer"
## 
## $visibility
## [1] "numeric"
## 
## $air_pressure
## [1] "numeric"
## 
## $rel_humidity
## [1] "numeric"
## 
## $air_temp
## [1] "numeric"
## 
## $sunshine_duration
## [1] "numeric"
## 
## $precip_h_mm
## [1] "numeric"
## 
## $precip_type
## [1] "factor"
## 
## $wind_highest_last_h
## [1] "numeric"
## 
## $wind_speed
## [1] "numeric"
## 
## $cloud_cover
## [1] "factor"
## 
## $Hour
## [1] "factor"
## 
## $Weekday
## [1] "factor"
## 
## $snow_cover_cm
## [1] "numeric"
## 
## $Holiday
## [1] "factor"
## 
## $School_holidays
## [1] "factor"
## 
## $wind_dir
## [1] "factor"
# Cloud cover still has negative value... is this na?
unique(data_final_aggr$cloud_cover) |> sort() |> head(100)
##  [1] -1 0  1  2  3  4  5  6  7  8 
## Levels: -1 0 1 2 3 4 5 6 7 8
length(which(data_final_aggr$cloud_cover == -1))
## [1] 43
# cloud cover => inspect -1

# Identify data as NA and clear df from them thereafter.
data_final_aggr$cloud_cover[which(data_final_aggr$cloud_cover == -1)] <- NA
data_final_aggr <- data_final_aggr |> na.omit()


# Create dummies for months as well
data_final_aggr$Month <- as.factor(substr(data_final_aggr$Date_Hour, 6, 7))


# Recall once more: Not entire period has data available! 
# This means there will be gaps of the model.
plot(table_availability$Date_Hour, table_availability$Zählstelle,# type = 'l',
     col = ifelse(table_availability$Zählstelle == 5, 'darkgreen', 'grey'),
     main = "Data availability", sub = "For green dots, all 5 stations reported")

# FINAL
format(object.size(data_final_aggr), units = "MB") # check for size
## [1] "6.3 Mb"
saveRDS(data_final_aggr, file = paste(wd, "data/final/data_final_bike_weather.RDS", sep = "/"))

# CLEAR environment to continue with only the relevant data set
rm(list = setdiff(ls(), "data_final_aggr"))
# } #end of if process raw data.

Further preparation for more complex models

Test split is used for comparison metrics at the end of every model

df <- data_final_aggr
names(df) # Print variables
##  [1] "Date_Hour"           "Bikes_Counted"       "visibility"         
##  [4] "air_pressure"        "rel_humidity"        "air_temp"           
##  [7] "sunshine_duration"   "precip_h_mm"         "precip_type"        
## [10] "wind_highest_last_h" "wind_speed"          "cloud_cover"        
## [13] "Hour"                "Weekday"             "snow_cover_cm"      
## [16] "Holiday"             "School_holidays"     "wind_dir"           
## [19] "Month"
###################################################
# Removed a part of the data when testing to limit run time

df <- df[sample(nrow(df), size = floor(0.100 * nrow(df))), ]  # influences model run time a lot
# Set to 100% of the data for the final rendering

###################################################

set.seed(123)  # for reproducibility
n <- nrow(df)
train_index <- sample(seq_len(n), size = 0.8 * n)

train <- df[train_index, ]
test  <- df[-train_index, ]

Models 1 & 2

  • Use at least two different models to investigate your regression or classification problem. Both models receive the same input variables.
  • If you tune your model/s for some metric, only use the final tuned version of your model in the comparison.
  • Evaluate and compare your models based on a reasonable evaluation metric of your choice. You must use the same metric for both models. Report both the training and the CV loss.

Linear Model

Simplistic model

First lm analysis: Model on all variables in the Data set, except for Date_Hour

lm_exp <- lm(data_final_aggr, formula = Bikes_Counted ~ . - Date_Hour)
lm_exp |> summary()
## 
## Call:
## lm(formula = Bikes_Counted ~ . - Date_Hour, data = data_final_aggr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -265.92  -36.16   -2.93   24.98 1102.16 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -398.36544   41.42035  -9.618  < 2e-16 ***
## visibility            -0.16207    0.02418  -6.702 2.08e-11 ***
## air_pressure           0.43533    0.03971  10.962  < 2e-16 ***
## rel_humidity          -0.78779    0.03642 -21.630  < 2e-16 ***
## air_temp               3.89467    0.09595  40.591  < 2e-16 ***
## sunshine_duration      1.36000    0.02335  58.254  < 2e-16 ***
## precip_h_mm           -1.12857    0.66552  -1.696 0.089935 .  
## precip_type2         -42.96897   36.56421  -1.175 0.239934    
## precip_type4         -17.01243    9.63589  -1.766 0.077481 .  
## precip_type6         -11.92730    0.99994 -11.928  < 2e-16 ***
## precip_type7          10.69082    5.34677   1.999 0.045560 *  
## precip_type8           5.10923    4.49564   1.136 0.255758    
## wind_highest_last_h    0.55203    0.26502   2.083 0.037260 *  
## wind_speed            -4.68807    0.40310 -11.630  < 2e-16 ***
## cloud_cover1          -4.07176    1.52886  -2.663 0.007741 ** 
## cloud_cover2           1.55867    1.70823   0.912 0.361536    
## cloud_cover3          -3.56006    1.76602  -2.016 0.043819 *  
## cloud_cover4          -5.10305    1.83424  -2.782 0.005403 ** 
## cloud_cover5          -2.59075    1.78590  -1.451 0.146877    
## cloud_cover6          -1.46199    1.64113  -0.891 0.373018    
## cloud_cover7          -3.68954    1.38363  -2.667 0.007665 ** 
## cloud_cover8          -2.75783    1.48879  -1.852 0.063975 .  
## Hour2                  2.27745    2.12311   1.073 0.283414    
## Hour3                  6.23293    2.12382   2.935 0.003339 ** 
## Hour4                 17.26044    2.12648   8.117 4.88e-16 ***
## Hour5                 28.76377    2.13208  13.491  < 2e-16 ***
## Hour6                 26.27472    2.14530  12.248  < 2e-16 ***
## Hour7                 20.91949    2.16238   9.674  < 2e-16 ***
## Hour8                 33.27828    2.18028  15.263  < 2e-16 ***
## Hour9                 55.42019    2.20487  25.135  < 2e-16 ***
## Hour10                69.31265    2.22275  31.183  < 2e-16 ***
## Hour11                75.93243    2.23874  33.917  < 2e-16 ***
## Hour12                87.43285    2.25154  38.832  < 2e-16 ***
## Hour13                94.30070    2.26033  41.720  < 2e-16 ***
## Hour14                90.20336    2.26427  39.838  < 2e-16 ***
## Hour15                76.85222    2.25732  34.046  < 2e-16 ***
## Hour16                51.40162    2.23939  22.953  < 2e-16 ***
## Hour17                22.92969    2.21399  10.357  < 2e-16 ***
## Hour18                 0.15166    2.18812   0.069 0.944744    
## Hour19                -9.25724    2.16377  -4.278 1.89e-05 ***
## Hour20                -7.70772    2.14761  -3.589 0.000332 ***
## Hour21                -6.44862    2.13384  -3.022 0.002512 ** 
## Hour22                -4.95042    2.12671  -2.328 0.019930 *  
## Hour23                -3.15770    2.12332  -1.487 0.136980    
## WeekdayMon            -0.01278    1.17585  -0.011 0.991326    
## WeekdaySat            21.65892    1.17080  18.499  < 2e-16 ***
## WeekdaySun            60.16976    1.17372  51.264  < 2e-16 ***
## WeekdayThu             3.33215    1.17152   2.844 0.004453 ** 
## WeekdayTue             0.57126    1.17088   0.488 0.625629    
## WeekdayWed             3.11504    1.17357   2.654 0.007949 ** 
## snow_cover_cm          0.01569    0.62413   0.025 0.979940    
## Holiday1              59.93505    2.75290  21.772  < 2e-16 ***
## School_holidays1      -0.95914    0.92446  -1.038 0.299500    
## wind_dirN             -0.56714    1.59339  -0.356 0.721893    
## wind_dirNE            -5.44559    1.60769  -3.387 0.000707 ***
## wind_dirNW            -3.63588    1.74741  -2.081 0.037463 *  
## wind_dirS              0.89384    1.54368   0.579 0.562568    
## wind_dirSE            -3.38121    1.59239  -2.123 0.033728 *  
## wind_dirSW             3.43129    1.54751   2.217 0.026607 *  
## wind_dirW             -1.84920    1.58038  -1.170 0.241967    
## Month02               -5.00103    1.77048  -2.825 0.004735 ** 
## Month03                1.42147    1.76866   0.804 0.421574    
## Month04                7.67926    1.81994   4.220 2.45e-05 ***
## Month05               -1.06303    1.93444  -0.550 0.582647    
## Month06              -15.14071    2.14612  -7.055 1.75e-12 ***
## Month07              -12.86941    2.20352  -5.840 5.24e-09 ***
## Month08              -10.58036    2.19436  -4.822 1.43e-06 ***
## Month09               -4.41726    2.01266  -2.195 0.028187 *  
## Month10               -0.36999    1.82601  -0.203 0.839431    
## Month11                2.43963    1.72678   1.413 0.157714    
## Month12                1.72245    1.72095   1.001 0.316894    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 73.05 on 54448 degrees of freedom
## Multiple R-squared:  0.5599, Adjusted R-squared:  0.5594 
## F-statistic: 989.7 on 70 and 54448 DF,  p-value: < 2.2e-16
lm_exp_pred <- data.frame(Date = data_final_aggr$Date_Hour,
                          Actual = data_final_aggr$Bikes_Counted,
                          Pred = predict(lm_exp, data_final_aggr))

# Note that negative predictions are possible
min(lm_exp_pred$Pred)
## [1] -103.7233
# any(grepl("2021-05", data_final_aggr$Date_Hour))
# data_final_aggr[which(grepl("2021-05", data_final_aggr$Date_Hour)),]


library(ggplot2)

# Get some samples for plotting the model predictions
set.seed(06124); samples <- sample(1:(nrow(lm_exp_pred) - 150), 20)

for (i in 1:length(samples)) {
  
  plot <- ggplot(data = lm_exp_pred[samples[i]:(samples[i] + 150),], aes(x = Date, y = Actual), alpha = 0.7) + 
    geom_line() +
    geom_line(aes(y = Pred), color = "red") +
    ggtitle("Examples: Actual vs. prediction")
  
  print(plot)
}

# Peaks are not so well identified
# Overprediction in winter!
ggplot(data = lm_exp_pred, aes(x = Date, y = Actual), alpha = 0.7)  + geom_line() +
  ggtitle("Actual bike traffic through the time")

# Periods with data:
# Note that when not all stations are available, then these dates are not in the sample
# due to our decisions during aggregation. Only points in time that were green in the Data availability 
# plot are included!
# Negative values still possible.
# See how fit increases when negative values are all set to zero:
lm_exp_pred$Pred[which(lm_exp_pred$Pred < 0)] <- 0
new_rsq <- 1 - ( sum((lm_exp_pred$Pred - lm_exp_pred$Actual)**2) / sum((lm_exp_pred$Actual - mean(lm_exp_pred$Actual))**2) ); print(new_rsq) # only slightly better than before.
## [1] 0.5793955

Now comes the time of feature engineering: Thinking about possible interaction terms, squares, logs, etc.

# Find out dates with highest mispredictions to find out what might be relevant
lm_exp_pred$eps_sq <- (lm_exp_pred$Actual - lm_exp_pred$Pred) ** 2
lm_exp_pred$eps <- (lm_exp_pred$Actual - lm_exp_pred$Pred)
lm_exp_pred$abs_error <- abs(lm_exp_pred$Actual - lm_exp_pred$Pred) 

pred_error_tbl <- aggregate(lm_exp_pred, eps ~ substr(Date, 6, 10), FUN = mean)
pred_error_tbl$`substr(Date, 6, 10)` <- as.Date(pred_error_tbl$`substr(Date, 6, 10)`, format = "%m-%d")
colnames(pred_error_tbl)[1] <- "Date"
plot(pred_error_tbl$Date, pred_error_tbl$eps, main = "Mean eps", sub = "Attention! Absolute eps can be misleading")

No clear seasonal misprediction pattern.

Interaction terms

For finding out interesting combinations, I run an lm of Bikes_Counted on all possible combinations of two variables, each separately, first, and then add their interaction term. When the R² of the latter model is way higher, then this speaks for an important interaction term.

# Create interaction terms

colnames(data_final_aggr)
##  [1] "Date_Hour"           "Bikes_Counted"       "visibility"         
##  [4] "air_pressure"        "rel_humidity"        "air_temp"           
##  [7] "sunshine_duration"   "precip_h_mm"         "precip_type"        
## [10] "wind_highest_last_h" "wind_speed"          "cloud_cover"        
## [13] "Hour"                "Weekday"             "snow_cover_cm"      
## [16] "Holiday"             "School_holidays"     "wind_dir"           
## [19] "Month"
# Which interaction terms have high explanatory power, all alone?
dummy_results <- matrix(NA, ncol(data_final_aggr) - 1, ncol(data_final_aggr) - 1)

# Create variable combinations to loop through
colnames(dummy_results) <- colnames(data_final_aggr[-which(colnames(data_final_aggr) == "Bikes_Counted")])
rownames(dummy_results) <- colnames(data_final_aggr[-which(colnames(data_final_aggr) == "Bikes_Counted")])


for(var1 in rownames(dummy_results)) {
  
  for(var2 in colnames(dummy_results)) {
    
    if(var1 != var2) { # only for 2 different variables, exclude diagonal
      # Look at increase of R² w.r.t. model without interaction term but only both vars
      # This helps to evaluate how much additional power comes from the interaction term
      formula1 <- as.formula(paste("Bikes_Counted ~ ", var1, " + ", var2, sep = ""))
      formula2 <- as.formula(paste("Bikes_Counted ~ ", var1, " + ", var2, " + ", var1, ":", var2, sep = ""))
      
      rsq_1 <- summary(lm(data = data_final_aggr, formula = formula1))$r.squared
      rsq_2 <- summary(lm(data = data_final_aggr, formula = formula2))$r.squared
      
      dummy_results[var1, var2] <- rsq_2 - rsq_1
      
    } else {
      
      dummy_results[var1, var2] <- NA
      
    }
    
  }
  
}

# Inspect
View(as.data.frame(dummy_results))

heatmap(dummy_results, Rowv = NA, Colv = "Rowv")

# Store in format that is easier to use
res_flat <- as.vector(dummy_results)

# Get the names of the possible combinations
names(res_flat) <- apply(expand.grid(rownames(dummy_results), colnames(dummy_results)), 1, paste, collapse = ":") 

# Only rows 1, 3, ... needed -> always double occurrence because of symmetric matrix
res_flat <- res_flat[seq(1, length(res_flat), 2)]

# Find 50 highest increases of R²
sort(res_flat, decreasing = TRUE)[1:50]
##                     Weekday:Hour        Weekday:sunshine_duration 
##                      0.088786223                      0.068515676 
##                 cloud_cover:Hour                    air_temp:Hour 
##                      0.066838099                      0.064427108 
##             Weekday:rel_humidity                 air_temp:Weekday 
##                      0.046916448                      0.036676170 
##                 Weekday:air_temp                   air_temp:Month 
##                      0.036676170                      0.030358755 
##     wind_dir:wind_highest_last_h     wind_highest_last_h:wind_dir 
##                      0.025438644                      0.025438644 
##  wind_highest_last_h:cloud_cover  cloud_cover:wind_highest_last_h 
##                      0.023276072                      0.023276072 
##              wind_dir:wind_speed            air_temp:rel_humidity 
##                      0.019849925                      0.017111432 
## air_pressure:wind_highest_last_h wind_highest_last_h:air_pressure 
##                      0.015783914                      0.015783914 
##   wind_highest_last_h:wind_speed                    Weekday:Month 
##                      0.015275988                      0.014967733 
##        wind_highest_last_h:Month                    wind_dir:Hour 
##                      0.013852503                      0.013808261 
##                cloud_cover:Month              cloud_cover:Weekday 
##                      0.012016789                      0.011852918 
##              Weekday:cloud_cover           cloud_cover:wind_speed 
##                      0.011852918                      0.011848038 
##       air_temp:sunshine_duration            air_pressure:air_temp 
##                      0.011825016                      0.011553103 
##            air_temp:air_pressure                air_pressure:Hour 
##                      0.011553103                      0.011351883 
##             cloud_cover:air_temp             air_temp:cloud_cover 
##                      0.010792805                      0.010792805 
##         cloud_cover:rel_humidity     wind_highest_last_h:air_temp 
##                      0.010113877                      0.009482674 
##     air_temp:wind_highest_last_h              air_temp:visibility 
##                      0.009482674                      0.008897519 
##                   wind_dir:Month                     Holiday:Hour 
##                      0.008825993                      0.008799381 
##          air_pressure:wind_speed         wind_highest_last_h:Hour 
##                      0.008365509                      0.008121863 
##                 precip_h_mm:Hour             air_temp:precip_type 
##                      0.007904556                      0.007693044 
##        Holiday:sunshine_duration  wind_highest_last_h:precip_type 
##                      0.007276480                      0.007199604 
##              Weekday:precip_type               air_pressure:Month 
##                      0.006594162                      0.006131581 
##                 air_temp:Holiday                 Holiday:air_temp 
##                      0.005639090                      0.005639090 
##               Weekday:visibility             Holiday:rel_humidity 
##                      0.005263972                      0.005135211 
##             wind_dir:cloud_cover             cloud_cover:wind_dir 
##                      0.004936417                      0.004936417
# Only use 20 best
twenty_best_interactions <- sort(res_flat, decreasing = TRUE)[1:20] |> names() |> 
  paste(collapse = " + ")

lm_exp$call
## lm(formula = Bikes_Counted ~ . - Date_Hour, data = data_final_aggr)
# Prepare formula
#paste(twenty_best_interactions, collapse = " + ")
lm_exp_plus_interaction_formula = paste("Bikes_Counted ~ . + ", twenty_best_interactions, " - Date_Hour", sep = "") |> as.formula()

# Train model
lm_exp_plus_interaction <- lm(data = data_final_aggr,
                              formula = lm_exp_plus_interaction_formula)

summary(lm_exp_plus_interaction)$r.squared - summary(lm_exp)$r.squared
## [1] 0.1845769
# Strong increase! -> roughly 20 pct withb only 20 terms

Further functional forms

In the following, it might be interesting to look at other functional forms - in our case, quadratics and cubics. This will be done by running three models of Bikes_Counted on any explanatory variable

  1. alone
  2. alone + quadratics
  3. alone + quadratics + cubics

Then, again, I will check whether this brought about substantial improvements w. r. t. the baseline.

# Other functional forms: squared?
single_var_sq_results <- matrix(NA, nrow = nrow(dummy_results), ncol = 2)
colnames(single_var_sq_results) <- c("Squares added: Diff. in R²", "Cubics added: Diff. in R²")
rownames(single_var_sq_results) <- rownames(dummy_results)

for(k in rownames(single_var_sq_results)) {
  
  if(is.numeric(data_final_aggr[[k]])){
  
  f1 <- as.formula(paste("Bikes_Counted ~ ", k, sep = ""))
  f2 <- as.formula(paste("Bikes_Counted ~ ", k, " + I(", k, "**2)", sep = ""))
  f3 <- as.formula(paste("Bikes_Counted ~ ", k, " + I(", k, "**2)", " + I(", k, "**3)", sep = ""))
  
  baseline_rsq <- summary(lm(data = data_final_aggr, formula = f1))$r.squared
  squares_rsq <- summary(lm(data = data_final_aggr, formula = f2))$r.squared
  cubics_rsq <- summary(lm(data = data_final_aggr, formula = f3))$r.squared
  
  # Increases in R² thanks to squares / quadratics
  single_var_sq_results[k, 1] <- squares_rsq - baseline_rsq
  single_var_sq_results[k, 2] <- cubics_rsq - baseline_rsq
  }
}

single_var_sq_results <- single_var_sq_results |> na.omit()


data_mat <- as.matrix(single_var_sq_results[order(single_var_sq_results[,1], decreasing = TRUE),])
rownames_vec <- rownames(data_mat)
cols <- c("blue", "red")

# Plot these improvements in R²
{
matplot(data_mat, pch = 19, col = cols, xaxt = "n",
        xlab = "Variable", ylab = "ΔR²", main = "R² - Increase through transformations")
axis(1, at = 1:nrow(data_mat), labels = rownames_vec, las = 2, cex.axis = 0.5)

legend("topright", legend = colnames(data_mat), col = cols, pch = 19)
}

# useful:
# wind_highest_last_h as **2, air_temp as **2 and **3, rel_humidity as **2, air_pressure as **2
added_functional_forms <- paste("I(wind_highest_last_h ** 2) + I(air_temp ** 3) + I(air_temp ** 2) + I(rel_humidity ** 2) + I(air_pressure ** 2)")

# Prepare formula as object, can be used later
lm_exp_plus_interaction_functforms_formula <- paste(
  paste(deparse(lm_exp_plus_interaction_formula), collapse = ""), " + ", 
  added_functional_forms, sep = "") |> as.formula()

lm_exp_plus_interaction_functforms <- lm(data = data_final_aggr,
                              formula = lm_exp_plus_interaction_functforms_formula)

summary(lm_exp_plus_interaction_functforms)$r.squared - summary(lm_exp)$r.squared
## [1] 0.1972999
# Only minimal increase in fit... given it's 5 additional variables
summary(lm_exp_plus_interaction_functforms)$r.squared - summary(lm_exp_plus_interaction)$r.squared
## [1] 0.01272301

Let’s quickly visualize the predictions:

predpower_comparison <- data.frame(
  Date_Time = data_final_aggr$Date_Hour,
  Actual = data_final_aggr$Bikes_Counted,
  `LM_Baseline` = predict(lm_exp, data_final_aggr),
  `LM_Interaction_terms` = predict(lm_exp_plus_interaction, data_final_aggr),
  `LM_Interaction_terms_FunctForms` = predict(lm_exp_plus_interaction_functforms, data_final_aggr)
  )

# We want no negative predictions!
predpower_comparison[predpower_comparison < 0] <- 0

prettypal <- c('Actual' = 'darkgrey', 'LM Baseline' = "#e41a1c", 'LM + Interact.' = "#377eb8", 'LM + Interact. + Fct. Forms' = "#4daf4a")#, "#984ea3", "#ff7f00")



ggplot(data = predpower_comparison, aes(x = Date_Time)) + 
  geom_line(aes(y = Actual, color = "Actual")) + 
  geom_line(aes(y = LM_Baseline, color = 'LM Baseline')) +
  geom_line(aes(y = LM_Interaction_terms, color = 'LM + Interact.')) +
  geom_line(aes(y = LM_Interaction_terms_FunctForms, color = 'LM + Interact. + Fct. Forms')) +
  scale_color_manual(values = prettypal) + 
  ggtitle("Models vs. actual values")

It is for sure more sensible to sample some periods:

n_sample_periods <- 20
length_sample_periods <- 100

for(i in 1:n_sample_periods) {
  subsample_startrow <- sample(1:(nrow(predpower_comparison) - length_sample_periods), 1)
  subsample_endrow <- subsample_startrow + length_sample_periods
  plotsample <- predpower_comparison[subsample_startrow:subsample_endrow,]
  
  plot <- ggplot(data = plotsample, aes(x = Date_Time)) + 
  geom_line(aes(y = Actual, color = "Actual")) + 
  geom_line(aes(y = LM_Baseline, color = 'LM Baseline')) +
  geom_line(aes(y = LM_Interaction_terms, color = 'LM + Interact.')) +
  geom_line(aes(y = LM_Interaction_terms_FunctForms, color = 'LM + Interact. + Fct. Forms')) +
  scale_color_manual(values = prettypal) +
  ggtitle("Example periods: Models vs. actual value")
  
  print(plot)
  
}

Is the partly very good explanatory power due to over-fitting or would it also on random testing data (optimistic, because lm is not that much prone to overfitting with \(n_{obs} >> n_{explanatory variables}\))? Therefore, I am performing a hold-out cross-validation with 80-20 training-testing split.

# Prepare function to get rid of negative predictions
no_negative_predictions <- function(array) {array[which(array < 0)] <- 0; return(array)}

# Prepare functions to get metrics quickly
calc_rsq <- function(actual, predicted) {
  SSR <- sum((actual - predicted) ** 2)
  SST <- (sum((actual - mean(actual)) ** 2))
  rsq <- 1 - SSR / SST
  return(rsq)
}

calc_mae <- function(actual, predicted) {
  mae <- mean(abs(actual - predicted))
  return(mae)
}
  
calc_rmse <- function(actual, predicted) {
  rmse <- sqrt(mean((actual - predicted) ** 2))  
  return(rmse)
} 

Cross-validation

Note: The abbreviations mean: - LM: Baseline lm - LM + IA: LM + 20 most promising interaction terms - LM + IA + FF: LM + IA + 5 most promising functional forms

set.seed(0345)
iter_cv <- 5

cv_res_matrix <- matrix(NA, nrow = iter_cv, ncol = (3 * 2 * 3)) # 3 models - 2 splits - 3 metrics

models <- c("LM", "LM + IA", "LM + IA + FF")
splits <- c("Train", "Test")
metrics <- c("RMSE", "MAE", "R2")

colnames(cv_res_matrix) <- paste(rep(models, each = 6), 
                                 rep(rep(splits, each = 3), 3),
                                 rep(metrics, 6),
                                 sep = ", ")

# Store cross-validation RMSE
for(j in 1:iter_cv) {
  
  # Hold-out CV: always get 80% for training and 20% for testing
  train_sample <- sample(1:nrow(data_final_aggr), floor(0.8 * nrow(data_final_aggr)))
  
  data_train <- data_final_aggr[train_sample,]
  data_test <- data_final_aggr[- train_sample,]
  
  # Train data and find performance on training set
  lm0 <- lm(formula = Bikes_Counted ~ . - Date_Hour, data = data_train)
  lm1 <- lm(formula = lm_exp_plus_interaction_formula, data = data_train)
  lm2 <- lm(formula = lm_exp_plus_interaction_functforms_formula, data = data_train)
  
  train_pred_0 <- predict(lm0, data_train) |> no_negative_predictions()
  train_pred_1 <- predict(lm1, data_train) |> no_negative_predictions()
  train_pred_2 <- predict(lm2, data_train) |> no_negative_predictions()
  
  # Predict on testing set as well
  test_pred_0 <- predict(lm0, data_test) |> no_negative_predictions()
  test_pred_1 <- predict(lm1, data_test) |> no_negative_predictions()
  test_pred_2 <- predict(lm2, data_test) |> no_negative_predictions()

  
  cv_res_matrix[j, "LM, Train, RMSE"] <- calc_rmse(data_train$Bikes_Counted, train_pred_0)
  cv_res_matrix[j, "LM, Train, MAE"] <- calc_mae(data_train$Bikes_Counted, train_pred_0)
  cv_res_matrix[j, "LM, Train, R2"] <- calc_rsq(data_train$Bikes_Counted, train_pred_0)
  cv_res_matrix[j, "LM, Test, RMSE"] <- calc_rmse(data_test$Bikes_Counted, test_pred_0)
  cv_res_matrix[j, "LM, Test, MAE"] <- calc_mae(data_test$Bikes_Counted, test_pred_0)
  cv_res_matrix[j, "LM, Test, R2"] <- calc_rsq(data_test$Bikes_Counted, test_pred_0)
  
  cv_res_matrix[j, "LM + IA, Train, RMSE"] <- calc_rmse(data_train$Bikes_Counted, train_pred_1)
  cv_res_matrix[j, "LM + IA, Train, MAE"] <- calc_mae(data_train$Bikes_Counted, train_pred_1)
  cv_res_matrix[j, "LM + IA, Train, R2"] <- calc_rsq(data_train$Bikes_Counted, train_pred_1)
  cv_res_matrix[j, "LM + IA, Test, RMSE"] <- calc_rmse(data_test$Bikes_Counted, test_pred_1)
  cv_res_matrix[j, "LM + IA, Test, MAE"] <- calc_mae(data_test$Bikes_Counted, test_pred_1)
  cv_res_matrix[j, "LM + IA, Test, R2"] <- calc_rsq(data_test$Bikes_Counted, test_pred_1)
  
  cv_res_matrix[j, "LM + IA + FF, Train, RMSE"] <- calc_rmse(data_train$Bikes_Counted, train_pred_2)
  cv_res_matrix[j, "LM + IA + FF, Train, MAE"] <- calc_mae(data_train$Bikes_Counted, train_pred_2)
  cv_res_matrix[j, "LM + IA + FF, Train, R2"] <- calc_rsq(data_train$Bikes_Counted, train_pred_2)
  cv_res_matrix[j, "LM + IA + FF, Test, RMSE"] <- calc_rmse(data_test$Bikes_Counted, test_pred_2)
  cv_res_matrix[j, "LM + IA + FF, Test, MAE"] <- calc_mae(data_test$Bikes_Counted, test_pred_2)
  cv_res_matrix[j, "LM + IA + FF, Test, R2"] <- calc_rsq(data_test$Bikes_Counted, test_pred_2)
  
  
}

print(cv_res_matrix |> as.data.frame())
##   LM, Train, RMSE LM, Train, MAE LM, Train, R2 LM, Test, RMSE LM, Test, MAE
## 1        71.84773       38.13420     0.5794026       69.52141      37.47217
## 2        71.94497       38.05296     0.5779706       68.89111      37.40996
## 3        71.86519       37.94488     0.5772654       69.40460      37.70370
## 4        72.01739       38.01281     0.5766807       68.81103      37.61346
## 5        71.57650       38.03429     0.5781974       70.53839      37.11994
##   LM, Test, R2 LM + IA, Train, RMSE LM + IA, Train, MAE LM + IA, Train, R2
## 1    0.5780675             54.64346            27.71116          0.7567142
## 2    0.5869450             54.92215            27.76031          0.7540556
## 3    0.5875923             55.04246            27.75410          0.7520144
## 4    0.5897026             55.14801            27.76916          0.7517707
## 5    0.5841033             54.75681            27.72497          0.7531434
##   LM + IA, Test, RMSE LM + IA, Test, MAE LM + IA, Test, R2
## 1            54.38837           27.77833         0.7417633
## 2            53.18438           27.62686         0.7538217
## 3            52.60181           27.58089         0.7631073
## 4            52.31573           27.56322         0.7628369
## 5            53.85173           27.28904         0.7575994
##   LM + IA + FF, Train, RMSE LM + IA + FF, Train, MAE LM + IA + FF, Train, R2
## 1                  53.12346                 27.00214               0.7700608
## 2                  53.38092                 27.04676               0.7676654
## 3                  53.59656                 27.07764               0.7648719
## 4                  53.59619                 27.03846               0.7655441
## 5                  53.23952                 27.03315               0.7666345
##   LM + IA + FF, Test, RMSE LM + IA + FF, Test, MAE LM + IA + FF, Test, R2
## 1                 53.02584                27.09378              0.7545399
## 2                 51.86227                27.08503              0.7659091
## 3                 50.93540                26.76405              0.7778789
## 4                 51.10568                26.95470              0.7736810
## 5                 52.46829                26.61718              0.7698938
# cv_res_matrix |> as.data.frame() |> View()

mean_of_all_iters <- apply(cv_res_matrix, 2, mean)

Additional: Tweaking the lm

Note: The following section shows some alternations of the linear model (time lags, aggregations of weather of pre-day), however, the most relevant are the models that were just run. These will also be the ones that need to be compared to the more advanced models.

The actual performance of the models is better, if you will, because we can physically rule out negative bike traffic. Therefore, I get a measure which uses predictions without negative values (negative values replaced by 0):

# Calculate new R-sq's with no negative predictions -> should have even higher accuracy.

rsq_nonegative <- function(model) {
  
  actual_series <- model$model$Bikes_Counted
  fitted_values_series <- model$fitted.values
  
  fitted_values_series[fitted_values_series < 0] <- 0
  
  RSS <- sum((actual_series - fitted_values_series) ** 2)
  TSS <- sum((actual_series - mean(actual_series)) ** 2)
  
  rsq_nonegativepreds <- 1 - RSS / TSS
  
  return(rsq_nonegativepreds)
  
}

# Note: Here, I check models which saw the entire dataset 

lm_exp |> rsq_nonegative()
## [1] 0.5793955
summary(lm_exp)$r.squared
## [1] 0.5599292
lm_exp_plus_interaction |> rsq_nonegative()
## [1] 0.7546476
summary(lm_exp_plus_interaction)$r.squared
## [1] 0.7445061
lm_exp_plus_interaction_functforms |> rsq_nonegative()
## [1] 0.7678233
summary(lm_exp_plus_interaction_functforms)$r.squared
## [1] 0.7572291

Using time lags

Now: Run an additional model: This one takes into account a selected number of time lags + all explanatory variables.

# Define own function to repeat for different models and number of timelags
lm_with_timelags <- function(dta, formula, timelags_to_add) {
  
  timeseries <- dta$Bikes_Counted
  
  for(nlag in 1:timelags_to_add) {
    
    name <- paste("Bikes_Counted_L", nlag, sep = "")
    
    raw_vec <- rep(NA, nrow(dta))
    raw_vec[(nlag + 1)  : (length(raw_vec))] <- 
      timeseries[1:(length(raw_vec) - nlag)]
    
    dta[[name]] <- raw_vec
    
  }
  
  dta <- dta |> na.omit()
  formula <- as.formula(formula)
  
  lm_final <- lm(formula = formula, data = dta)
  
  return(lm_final)
  
}

lm_exp_tl_1 <- lm_with_timelags(dta = data_final_aggr,
                                formula = Bikes_Counted ~ . - Date_Hour,
                                timelags_to_add = 1)

lm_exp_tl_6 <- lm_with_timelags(dta = data_final_aggr,
                                formula = Bikes_Counted ~ . - Date_Hour,
                                timelags_to_add = 6)



lm_exp_plus_interaction_tl_1 <- lm_with_timelags(dta = data_final_aggr,
                                formula = lm_exp_plus_interaction_formula,
                                timelags_to_add = 1)

lm_exp_plus_interaction_tl_6 <- lm_with_timelags(dta = data_final_aggr,
                                formula = lm_exp_plus_interaction_formula,
                                timelags_to_add = 6)



lm_exp_plus_interaction_functforms_tl_1 <- lm_with_timelags(dta = data_final_aggr,
                                formula = lm_exp_plus_interaction_functforms_formula,
                                timelags_to_add = 1)

lm_exp_plus_interaction_functforms_tl_6 <- lm_with_timelags(dta = data_final_aggr,
                                formula = lm_exp_plus_interaction_functforms_formula,
                                timelags_to_add = 6)

# Make use of own pre-defined function for R² without negative predictions
lm_exp_tl_1 |> rsq_nonegative()
## [1] 0.9206689
lm_exp_tl_6 |> rsq_nonegative()
## [1] 0.9452319
lm_exp_plus_interaction_tl_1 |> rsq_nonegative()
## [1] 0.944513
lm_exp_plus_interaction_tl_6 |> rsq_nonegative()
## [1] 0.955461
lm_exp_plus_interaction_functforms_tl_1 |> rsq_nonegative()
## [1] 0.9446086
lm_exp_plus_interaction_functforms_tl_6 |> rsq_nonegative()
## [1] 0.9556802
# Predictions of lm_exp_tl_6 - only roughly 1% worse with R² w.r.t. the model with 
# all interaction terms, although far lower complexity
# Prediction frame for visualization
lm_exp_tl_6_preds <- data.frame(Date_Time = lm_exp_tl_6$model$Date_Hour,
                                Actual = lm_exp_tl_6$model$Bikes_Counted,
                                Predicted = lm_exp_tl_6$fitted.values)
lm_exp_tl_6_preds$Predicted[lm_exp_tl_6_preds$Predicted < 0] <- 0

Now really the strongest effect of the explanatory power is from the timelags. This means, if we were to do forecasting for the next hour, our performance will be great in most cases.

Again, I sample some parts of the predicted data and compare it with the actual values:

n_sample_periods <- 20
length_sample_periods <- 100

for(i in 1:n_sample_periods) {
  subsample_startrow <- sample(1:(nrow(lm_exp_tl_6_preds) - length_sample_periods), 1)
  subsample_endrow <- subsample_startrow + length_sample_periods
  plotsample <- lm_exp_tl_6_preds[subsample_startrow:subsample_endrow,]
  
  plot <- ggplot(data = plotsample, aes(x = Date_Time)) + 
  geom_line(aes(y = Actual, color = "Actual")) + 
  geom_line(aes(y = Predicted, color = 'Simple LM + 6 Timelags')) +
  scale_color_manual(values = c("Actual" = prettypal[[2]], 
                                "Simple LM + 6 Timelags" = prettypal[[4]])) +
  ggtitle("Model with timelags: Prediction accuracy")
  
  print(plot)
  
}

There is a great fit for the very basic lm when 6 lags (= 6 previous hours) are added.

Prediction for next day

…with aggregated meteorological data of day before.

If there is any potential practical application case for such a project, e. g. for traffic management or gastronomical planning, then users might like to know at the beginning of the day already what the traffic will look like across the whole day.

For that purpose, I will aggregate means/sums/maximums of meteorological data from the day before, and ex-ante known criteria such as the weekday, the month, and the hour.

# Need mean
daily_means <- aggregate(data_final_aggr,
                         cbind(visibility, air_pressure, rel_humidity,
                               air_temp, wind_speed, cloud_cover, snow_cover_cm) ~ 
                           substr(Date_Hour, 1, 10),
                         FUN = mean)

# Need sum
daily_sum <- aggregate(data_final_aggr,
                         cbind(Bikes_Counted, sunshine_duration,
                               precip_h_mm) ~ 
                           substr(Date_Hour, 1, 10),
                         FUN = sum)


# Need max
daily_max <- aggregate(data_final_aggr,
                         cbind(wind_highest_last_h) ~ 
                           substr(Date_Hour, 1, 10),
                         FUN = max)


colnames(daily_max)[1] <- "Date"
colnames(daily_means)[1] <- "Date"
colnames(daily_sum)[1] <- "Date"

# Give meaningful prefixes
colnames(daily_max)[2:ncol(daily_max)] <- paste0("max_", colnames(daily_max)[2:ncol(daily_max)])
colnames(daily_means)[2:ncol(daily_means)] <- paste0("mean_", colnames(daily_means)[2:ncol(daily_means)])
colnames(daily_sum)[2:ncol(daily_sum)] <- paste0("sum_", colnames(daily_sum)[2:ncol(daily_sum)])



# Variables for which I can know the values ex-ante:
data_nolag <- data_final_aggr[, c("Date_Hour", "Hour", "Weekday", "Holiday",
                                  "School_holidays", "Month", "Bikes_Counted")]
# Bring the different metrics together
data_daily <- merge(daily_sum, merge(daily_max, daily_means, by = "Date"), by = "Date")

data_daily$Date <- data_daily$Date |> as.Date()
# Get the day for which the prediction is useful: The next day.
# This will be the day where I merge it
data_daily$onedaylater <- data_daily$Date + 1

data_nolag$Date <- data_nolag$Date |> as.Date()

# 1db = one day before. Merging brings together ex-ante known facts about the day plus
# the meteorological aggregated data of the day before
data_with_vals_1db <- merge(data_nolag, data_daily, by.x = "Date", by.y = "onedaylater",
                            all.x = TRUE)
data_with_vals_1db <- data_with_vals_1db[order(data_with_vals_1db$Date_Hour),]

# Note: data from pre-day used from 1 AM on
data_with_vals_1db <- data_with_vals_1db[data_with_vals_1db$Date_Hour >= "2016-01-27",]

lm(data_with_vals_1db, formula = Bikes_Counted ~ . - Date - Date_Hour - `Date.y`) |>
  summary()
## 
## Call:
## lm(formula = Bikes_Counted ~ . - Date - Date_Hour - Date.y, data = data_with_vals_1db)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -253.92  -42.46   -6.85   28.62 1142.53 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -9.711e+02  5.011e+01 -19.382  < 2e-16 ***
## Hour2                    4.742e-01  2.389e+00   0.199 0.842652    
## Hour3                    3.102e+00  2.389e+00   1.298 0.194201    
## Hour4                    1.330e+01  2.390e+00   5.565 2.64e-08 ***
## Hour5                    3.144e+01  2.391e+00  13.150  < 2e-16 ***
## Hour6                    4.135e+01  2.391e+00  17.295  < 2e-16 ***
## Hour7                    4.828e+01  2.391e+00  20.198  < 2e-16 ***
## Hour8                    7.275e+01  2.390e+00  30.444  < 2e-16 ***
## Hour9                    1.072e+02  2.389e+00  44.868  < 2e-16 ***
## Hour10                   1.285e+02  2.389e+00  53.808  < 2e-16 ***
## Hour11                   1.401e+02  2.389e+00  58.638  < 2e-16 ***
## Hour12                   1.556e+02  2.389e+00  65.127  < 2e-16 ***
## Hour13                   1.641e+02  2.389e+00  68.677  < 2e-16 ***
## Hour14                   1.606e+02  2.389e+00  67.239  < 2e-16 ***
## Hour15                   1.455e+02  2.390e+00  60.876  < 2e-16 ***
## Hour16                   1.129e+02  2.390e+00  47.237  < 2e-16 ***
## Hour17                   7.551e+01  2.390e+00  31.599  < 2e-16 ***
## Hour18                   4.184e+01  2.389e+00  17.513  < 2e-16 ***
## Hour19                   1.954e+01  2.389e+00   8.177 2.97e-16 ***
## Hour20                   7.936e+00  2.390e+00   3.321 0.000897 ***
## Hour21                   3.542e+00  2.387e+00   1.484 0.137786    
## Hour22                   1.994e+00  2.387e+00   0.835 0.403734    
## Hour23                   9.400e-01  2.389e+00   0.393 0.693956    
## WeekdayMon              -1.347e+01  1.444e+00  -9.328  < 2e-16 ***
## WeekdaySat               2.323e+01  1.319e+00  17.610  < 2e-16 ***
## WeekdaySun               5.612e+01  1.333e+00  42.115  < 2e-16 ***
## WeekdayThu               3.139e+00  1.319e+00   2.379 0.017346 *  
## WeekdayTue               7.890e-01  1.325e+00   0.595 0.551533    
## WeekdayWed               5.806e+00  1.322e+00   4.392 1.12e-05 ***
## Holiday1                 5.625e+01  3.100e+00  18.148  < 2e-16 ***
## School_holidays1        -1.004e+00  1.045e+00  -0.961 0.336539    
## Month02                  1.012e+01  2.004e+00   5.049 4.47e-07 ***
## Month03                  2.890e+01  2.036e+00  14.194  < 2e-16 ***
## Month04                  4.202e+01  2.115e+00  19.870  < 2e-16 ***
## Month05                  4.527e+01  2.229e+00  20.311  < 2e-16 ***
## Month06                  3.814e+01  2.481e+00  15.373  < 2e-16 ***
## Month07                  4.047e+01  2.549e+00  15.877  < 2e-16 ***
## Month08                  3.785e+01  2.551e+00  14.838  < 2e-16 ***
## Month09                  3.069e+01  2.330e+00  13.173  < 2e-16 ***
## Month10                  1.729e+01  2.074e+00   8.335  < 2e-16 ***
## Month11                  9.490e+00  1.951e+00   4.863 1.16e-06 ***
## Month12                  3.003e+00  1.938e+00   1.550 0.121264    
## sum_Bikes_Counted        1.112e-02  4.608e-04  24.132  < 2e-16 ***
## sum_sunshine_duration   -2.121e-02  3.307e-03  -6.412 1.45e-10 ***
## sum_precip_h_mm          2.292e-01  1.041e-01   2.201 0.027723 *  
## max_wind_highest_last_h -5.564e-01  1.809e-01  -3.075 0.002106 ** 
## mean_visibility          1.393e-01  3.771e-02   3.694 0.000221 ***
## mean_air_pressure        9.077e-01  4.775e-02  19.010  < 2e-16 ***
## mean_rel_humidity        2.599e-01  5.788e-02   4.490 7.15e-06 ***
## mean_air_temp            1.864e+00  1.198e-01  15.559  < 2e-16 ***
## mean_wind_speed         -2.615e-01  4.200e-01  -0.623 0.533503    
## mean_cloud_cover        -4.764e+00  3.280e-01 -14.523  < 2e-16 ***
## mean_snow_cover_cm       5.535e-01  7.002e-01   0.791 0.429194    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82.12 on 54250 degrees of freedom
##   (192 observations deleted due to missingness)
## Multiple R-squared:  0.4448, Adjusted R-squared:  0.4443 
## F-statistic: 835.8 on 52 and 54250 DF,  p-value: < 2.2e-16

The fit is only moderate w. r. t. previously fitted models. However, what happens if we account for interaction terms?

Again, I check for the increase in the model fit in case I include the interaction term additionally to the two single variables.

# Improve performance
checkgrid <- t(combn(colnames(data_with_vals_1db), 2)) |> as.data.frame()
colnames(checkgrid) <- c("var_a", "var_b")
checkgrid$r_sq_increase <- NA
# Don't need to check explained variable
checkgrid <- checkgrid[-which(checkgrid$var_a == "Bikes_Counted" |
                              checkgrid$var_b == "Bikes_Counted"),]
# Don't want to use date variables
checkgrid <- checkgrid[-which(grepl("Date", checkgrid$var_a) |
                              grepl("Date", checkgrid$var_b)),]


for(i in 1:nrow(checkgrid)) {
  var1 <- checkgrid$var_a[i]
  var2 <- checkgrid$var_b[i]
  
  f1 <- paste0("Bikes_Counted ~ ", var1, "+ ", var2) |> as.formula()
  f2 <- paste0("Bikes_Counted ~ ", var1, "+ ", var2, " + ", var1, ":", var2) |> as.formula()
    
  m1 <- lm(data_with_vals_1db, formula = f1)
  m2 <- lm(data_with_vals_1db, formula = f2)
  
  r_sq_increase <- summary(m2)$r.squared - summary(m1)$r.squared
    
  checkgrid$r_sq_increase[i] <- r_sq_increase
  
}


# Keep only interactions which increase model performance by more than 1%
View(checkgrid[order(checkgrid$r_sq_increase, decreasing = TRUE),])
add_interactions_combis <- checkgrid[which(checkgrid$r_sq_increase > 0.01),]

added_interactions <- paste(add_interactions_combis$var_a, add_interactions_combis$var_b, sep = ":") |>
  paste(collapse = " + ")

# Train model only on these interactions
lm_dailylag_interactions <- lm(data_with_vals_1db,
                               formula = as.formula(
                                 paste("Bikes_Counted ~ . - Date - Date_Hour - `Date.y`", 
                                       added_interactions, sep = " + ")))

lm_dailylag_interactions |> rsq_nonegative()
## [1] 0.6640124
lm_exp_plus_interaction |> rsq_nonegative()
## [1] 0.7546476

If we do not use the hourly data for weather to predict the bike traffic for exactly the same hour and instead rely on the means/maxs/sums of the day before and ex-ante known criteria of the points in time – e. g. holiday, weekday, … – then we are not even 10% worse than in the model which has varying data on the hourly basis! This is quite impressive… But no big surprise, once we think about it that a large share is already explained due to seasonality and times across the day.

Common metrics

  • . NOTE: The R² reported is the one that sets the negative predictions to zero. This uses the defined rsq_nonegative() function that was defined for the models.
  • RMSE. Root mean squared error.
  • MAE. Mean average error.
# FUNCTION PREPARATION

## R²
# rsq_nonegative()

## RMSE
get_rmse <- function(model) {
  rmse <- sqrt(mean((model$residuals) ** 2))
  return(rmse)
}

## MAE
get_mae <- function(model) {
  mae <- mean(abs(model$residuals))
  return(mae)
}
model_list <- c(
  "lm_exp",
  "lm_exp_plus_interaction",
  "lm_exp_plus_interaction_functforms",
  "lm_exp_tl_1",
  "lm_exp_tl_6",
  "lm_exp_plus_interaction_tl_1",
  "lm_exp_plus_interaction_tl_6",
  "lm_exp_plus_interaction_functforms_tl_1",
  "lm_exp_plus_interaction_functforms_tl_6",
  "lm_dailylag_interactions"
)

metrics_df <- data.frame()
# Model = model_list, 
#                          MAE = grep(NA, length(model_list)), 
#                          RMSE = grep(NA, length(model_list)), 
#                          R2 = grep(NA, length(model_list)))

for(m in model_list) {
  
  name <- m
  model <- get(m)
  
  thisMAE <- model |> get_mae()
  thisRMSE <- model |> get_rmse()
  thisR2 <- model |> rsq_nonegative()
  
  entry <- data.frame(Model = name, MAE = thisMAE, RMSE = thisRMSE, R2 = thisR2)
  metrics_df <- rbind(metrics_df, entry)
  
}

The time lag models and the data-of-the-day-before-model are not that bad, but for further comparison, the three relevant models are going to be the baseline model in its two variations with added interaction terms and functional forms.

Note that the Cross-Validation results for those were already calculated:

cv_res_inspection <- cv_res_matrix |> apply(2, mean) |> matrix(nrow = 3) |> t() # get mean of all iterations

colnames(cv_res_inspection) <- c("MAE", "RMSE", "R2")
rownames(cv_res_inspection) <- paste(rep(c("LM", "LM + IA", "LM + IA + FF"), each = 2),
                                     rep(c("Train", "Test"), 3), sep = ": ")
print(cv_res_inspection)
##                          MAE     RMSE        R2
## LM: Train           71.85036 38.03583 0.5779033
## LM: Test            69.43331 37.46385 0.5852821
## LM + IA: Train      54.90258 27.74394 0.7535397
## LM + IA: Test       53.26840 27.56767 0.7558257
## LM + IA + FF: Train 53.38733 27.03963 0.7669553
## LM + IA + FF: Test  51.87950 26.90295 0.7683806

Hereby, I recall that the three relevant alternations of the linear model are not prone to overfitting, given the results of the hold-out cross validation with 5 iterations and 80 Train - 20 Test split.

Prepare the necessary information for comparison with later models:

metrics_eriks_models <- metrics_df[
  which(metrics_df$Model %in% c("lm_exp", "lm_exp_plus_interaction",
                                "lm_exp_plus_interaction_functforms")), ]

# metrics_eriks_models$MSE <- metrics_eriks_models$RMSE^2 # not used for final model comparison

newmodelnames <- c("lm_exp" = "LM Baseline",
                   "lm_exp_plus_interaction" = "LM + IA",
                   "lm_exp_plus_interaction_functforms" = "LM + IA + FF")
metrics_eriks_models$Model <- newmodelnames[metrics_eriks_models$Model]

print(metrics_eriks_models)
##          Model      MAE     RMSE        R2
## 1  LM Baseline 43.34079 73.00132 0.5793955
## 2      LM + IA 30.64337 55.62367 0.7546476
## 3 LM + IA + FF 30.02102 54.22102 0.7678233

Tree Model

library(rpart)

# Initial model
model <- rpart(Bikes_Counted ~ . - Date_Hour, data = train, method = "anova", cp = 0, maxdepth = 5) 
# depth = 5 keeps training and cv loss close, without having them both be increased. It. also makes the cp plot visually readable 

# 10-fold cross-validation default in rpart
plotcp(model) 

best_cp <- model$cptable[which.min(model$cptable[, "xerror"]), "CP"]
best_cp
## [1] 0
# Prune the tree
pruned_model <- prune(model, cp = best_cp)

# Predict on test set
pred <- predict(pruned_model, newdata = test)

# True values
actual <- test$Bikes_Counted

# Metrics
mae <- mean(abs(pred - actual))
rmse <- sqrt(mean((pred - actual)^2))
r2 <- 1 - sum((actual - pred)^2) / sum((actual - mean(actual))^2)

# Output metrics
metrics_tree <- c(MAE = mae, RMSE = rmse, R2 = r2)

# Training MSE
train_pred <- predict(pruned_model, newdata = train)
train_actual <- train$Bikes_Counted
training_mse <- mean((train_actual - train_pred)^2)

# CV MSE (from xerror * RSS in cptable, divide by nrow(train))
cv_mse <- min(model$cptable[, "xerror"]) * mean((train_actual - mean(train_actual))^2)

# Output MSE losses
losses_tree <- c(Training_MSE = training_mse, CV_MSE = cv_mse)
losses_tree
## Training_MSE       CV_MSE 
##     3033.884     3951.999

Model insight: 10-fold cross-validation is apparently default in rpart package. The anova method uses mse for by default cv. The model would overfit (cv-error >> training-error), limiting tree depth to 5 levels kept the error rates for cv the same, while reducing the gap, leading to a non-overfit model.

Limiting tree depts also make the pruning parameter “best_cp” more stable, with the plot becoming readable. Without limiting tree depth, the plot was a chaos of too many observations.

Ensemble

  • Repeat the analysis using two ensemble methods.
  • Investigate the hyperparameter settings of your ensembles with regards to your evaluation metric.
  • Report both the training and the CV loss.
  • Select the best configurations of your ensemble models based on the same evaluation metric as before.

Ensemble 1: Random Forest

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
# Remove Date_Hour
train_data <- train[, !(names(train) %in% c("Date_Hour"))]

# Prepare predictors and target
x <- train_data[, -which(names(train_data) == "Bikes_Counted")]
y <- train_data$Bikes_Counted

# Tune mtry with OOB error
best_mtry <- tuneRF(
  x = x,
  y = y,
  mtryStart = 5,
  stepFactor = 1.5,
  improve = 0.05,
  ntreeTry = 100,
  trace = TRUE,
  plot = TRUE
)
## mtry = 5  OOB error = 2803.116 
## Searching left ...
## mtry = 4     OOB error = 2950.968 
## -0.05274581 0.05 
## Searching right ...
## mtry = 7     OOB error = 2682.917 
## 0.04288022 0.05

optimal_mtry <- best_mtry[which.min(best_mtry[, "OOBError"]), "mtry"]

# Set up 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)

# Train model with cross-validation using optimal mtry
set.seed(42)
model_cv <- caret::train(
  Bikes_Counted ~ .,
  data = train_data,
  method = "rf", # randomforest through tcaret whic hhandles cross validation
  trControl = ctrl,
  tuneGrid = data.frame(mtry = optimal_mtry),
  ntree = 500,
  nodesize = 100,
  maxnodes = 10
)

# Extract CV metrics
cv_mse <- model_cv$results$RMSE^2

# Final model from caret (trained on full train data)
model <- model_cv$finalModel

# Training predictions and MSE
train_pred <- predict(model_cv, newdata = train_data)
training_mse <- mean((train_data$Bikes_Counted - train_pred)^2)

# Prepare test data (remove Date_Hour, Bikes_Counted)
test_data <- test[, !(names(test) %in% c("Date_Hour", "Bikes_Counted"))]

# Test predictions and metrics
pred <- predict(model_cv, newdata = test_data)
actual <- test$Bikes_Counted

mae <- mean(abs(pred - actual))
rmse <- sqrt(mean((pred - actual)^2))
r2 <- 1 - sum((actual - pred)^2) / sum((actual - mean(actual))^2)

metrics_ens1 <- c(MAE = mae, RMSE = rmse, R2 = r2)
losses_rf <- c(Training_MSE = training_mse, CV_MSE = cv_mse)

losses_rf
## Training_MSE       CV_MSE 
##     7117.411     7140.580

Model insight: Using tuneRF gets different results when using a small number of trees. It was noticed that the higher the mtry parameter, the better the model performed. This was a clue to having more trees in tuneRF, which lead to a higher and more optimal mtry hyperparameter. Node size and max nodes were tweaked to reduce the cv error - training error gap. In thge previous tree model the tree depth could be controlled, which helped with overfitting, however this hyperparameter couldn’t be changes with this perticular package.

RandomForest does not have a cros validation feature. However caret can implement this framework with intuitive notation, and same hyperparameter names as the randomforest package.

Ensemble 2: Gradient Boosting

Note: gbm package does parallel compute (all CPU cores) - can go heavy on the hyperparameters

# install.packages("gbm")
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.3
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
train_data <- train[, !(names(train) %in% c("Date_Hour"))]
test_data <- test[, !(names(test) %in% c("Date_Hour", "Bikes_Counted"))]

set.seed(123)
model_gbm <- gbm(
  Bikes_Counted ~ .,
  data = train_data,
  distribution = "gaussian",
  n.trees = 400, # Note the loss plot at the end. after 400 there is low marginal progress
  interaction.depth = 2, 
  shrinkage = 0.05,
  cv.folds = 20
)

best_iter <- gbm.perf(model_gbm, method = "cv", plot.it = FALSE) # Plot is made later manually with legend

pred <- predict(model_gbm, newdata = test_data, n.trees = best_iter)
actual <- test$Bikes_Counted

mae <- mean(abs(pred - actual))
rmse <- sqrt(mean((pred - actual)^2))
r2 <- 1 - sum((actual - pred)^2) / sum((actual - mean(actual))^2)

metrics_ens2 <- c(MAE = mae, RMSE = rmse, R2 = r2)


# Training loss (MSE) at each iteration
train_loss <- model_gbm$train.error

# Validation loss (CV MSE) at each iteration
val_loss <- model_gbm$cv.error

# Plot both losses
plot(train_loss, type = "l", col = "grey", lwd = 2, ylab = "MSE", xlab = "Number of Trees")
lines(val_loss, col = "orange", lwd = 2)
legend("topright", legend = c("Training MSE", "CV MSE"), col = c("grey", "orange"), lwd = 2)

Model insight:

Many hyperparameters were tweaked. The most influential was interaction depth. With a low number this reduces the complexity of the model, making the training more stable. In the plot it can be seen that after 400 trees, there is no more gain to be had with cv-error, while the model starts to overfit.

Neural Network

  • Repeat the analysis with a neural network.
  • Investigate three different configurations with regards to your evaluation metric and select the best configuration. Use the same evaluation metric as before.
  • Report both the training and the CV loss.
library(keras)
## Warning: package 'keras' was built under R version 4.4.3
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(recipes)
## Warning: package 'recipes' was built under R version 4.4.3
## 
## Attaching package: 'recipes'
## The following object is masked from 'package:stats':
## 
##     step
library(tensorflow)
## Warning: package 'tensorflow' was built under R version 4.4.3
## 
## Attaching package: 'tensorflow'
## The following object is masked from 'package:caret':
## 
##     train
library(reticulate)
## Warning: package 'reticulate' was built under R version 4.4.3
# Remove unused columns
train_data <- train %>% select(-Date_Hour)
test_data <- test %>% select(-Date_Hour, -Bikes_Counted)

# Define preprocessing using `recipes`
rec <- recipe(Bikes_Counted ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%      # one-hot encode categoricals
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors()) %>%
  prep()
## Warning: !  The following column has zero variance so scaling cannot be used:
##   precip_type_X2.
## ℹ Consider using ?step_zv (`?recipes::step_zv()`) to remove those columns
##   before normalizing.
# Apply preprocessing
x_train <- bake(rec, new_data = train_data) %>% select(-Bikes_Counted) %>% as.matrix()
x_test <- bake(rec, new_data = test_data) %>% as.matrix()

y_train <- train_data$Bikes_Counted
y_test <- test$Bikes_Counted

# Build Keras model
model <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = "relu", input_shape = ncol(x_train)) %>%
  layer_dropout(rate = 0.2) %>% # The key to reduce overfitting
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1)

model %>% compile(
  loss = "mse",
  optimizer = optimizer_adam(),
  metrics = list("mae")
)
###############################
##### Other models tried: #####
###############################

  ## Original model: No drop out
      # model <- keras_model_sequential() %>%
      #  layer_dense(units = 32, activation = "relu", input_shape = ncol(x_train)) %>%
      #  layer_dense(units = 16, activation = "relu") %>%
      #  layer_dense(units = 1)

    # Bad metrics when comparing cross validation to training error

  ## Model with wider layers
      # model <- keras_model_sequential() %>%
      #   layer_dense(units = 64, activation = "relu", input_shape = ncol(x_train)) %>%
      #   layer_dense(units = 32, activation = "relu") %>%
      #  layer_dense(units = 1)

    # Needlessly complex, had similar results in CV error compared to the less wide model
    # the dropout layer helped the most


# Train model
history <- model %>% fit(
  x_train, y_train,
  epochs = 50, # Afterwhich it starts to overfit, and no extra gain for validation loss
  batch_size = 32,
  validation_split = 0.2,
  verbose = 0
)

plot(history) # Avoids per-epoch output, but keeps loss plot

# Predict + evaluate
pred <- model %>% predict(x_test) %>% as.vector()
## 35/35 - 0s - 45ms/epoch - 1ms/step
mae <- mean(abs(pred - y_test))
rmse <- sqrt(mean((pred - y_test)^2))
r2 <- 1 - sum((y_test - pred)^2) / sum((y_test - mean(y_test))^2)

metrics_nn <- c(MAE = mae, RMSE = rmse, R2 = r2)

Model insight: After trying multiple different layer configurations, the models always learn very fast after about 30 iterations, however these had major overfitting problems. Adding cross validation folds didn’t have too much of an effect like in other models. Therefore a dropout layer was tested, which helped a lot. Interestingly it is only useful between 2 wide layers, and not effective as the second to last layer. This is likely due ot overfitting taking part in the earlier wider layers. The weights between the last layers is more scarce and harder to influence through drop out.

Model Comparison

  • Compare the performance of the 5 models.
# Combine all metric vectors into a list
metrics_list <- list(
  tree = metrics_tree,
  ens1 = metrics_ens1,
  ens2 = metrics_ens2,
  nn = metrics_nn
)

# Convert to data frame
metrics_df <- do.call(rbind, metrics_list)

# Add model names as a column
metrics_df <- data.frame(Model = rownames(metrics_df), metrics_df, row.names = NULL)

# Ensure column names are the same
colnames(metrics_eriks_models) <- colnames(metrics_df)

# Bind the data frames
combined_metrics <- dplyr::bind_rows(metrics_eriks_models, metrics_df)

# OR using base R
combined_metrics <- rbind(metrics_eriks_models, metrics_df)

# Optional: pretty table
library(knitr)
kable(combined_metrics, digits = 3, caption = "Model Performance Metrics")
Model Performance Metrics
Model MAE RMSE R2
LM Baseline 43.341 73.001 0.579
LM + IA 30.643 55.624 0.755
LM + IA + FF 30.021 54.221 0.768
tree 29.080 60.502 0.702
ens1 45.999 84.918 0.413
ens2 26.719 51.144 0.787
nn 26.205 52.605 0.775

PCA

  • Run a PCA on your input variables and discuss the scope for dimensionality reduction in your dataset.
  • Rerun the previous 5 models on all PCs or on a reduced number of PCs.
# Drop non-numeric / ID columns from training data
df_base <- df[, !(names(df) %in% c("Date_Hour"))]

# Define categorical variables to one-hot encode
categorical_vars <- c("precip_type", "cloud_cover", "Hour", "Weekday", "Holiday", 
                      "School_holidays", "wind_dir", "Month")

# One-hot encode categorical variables using model.matrix (no intercept)
dummies <- model.matrix(~ . -1, data = df_base[, categorical_vars])

# Extract numeric variables
numeric_vars <- df_base[, !(names(df_base) %in% categorical_vars)]

# Combine numeric variables and dummy variables
df_encoded <- cbind(numeric_vars, dummies)

# Remove rows with NA in training set
clean_df <- na.omit(df_encoded)

# Run PCA on scaled data
pca <- prcomp(clean_df, scale. = TRUE)

# Summary and scree plot
summary(pca)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0718 1.76451 1.40242 1.29641 1.24209 1.21056 1.18307
## Proportion of Variance 0.0588 0.04265 0.02694 0.02302 0.02113 0.02007 0.01917
## Cumulative Proportion  0.0588 0.10145 0.12839 0.15142 0.17255 0.19262 0.21180
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.17237 1.13945 1.13051 1.12211 1.11865 1.11101 1.10927
## Proportion of Variance 0.01883 0.01779 0.01751 0.01725 0.01714 0.01691 0.01686
## Cumulative Proportion  0.23063 0.24841 0.26592 0.28317 0.30031 0.31722 0.33407
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     1.09851 1.09644 1.08699 1.08279 1.07558 1.07357 1.06638
## Proportion of Variance 0.01653 0.01647 0.01619 0.01606 0.01585 0.01579 0.01558
## Cumulative Proportion  0.35060 0.36707 0.38326 0.39932 0.41517 0.43096 0.44653
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     1.06220 1.05829 1.05591 1.05314 1.04543 1.04346 1.04077
## Proportion of Variance 0.01546 0.01534 0.01527 0.01519 0.01497 0.01492 0.01484
## Cumulative Proportion  0.46199 0.47733 0.49260 0.50780 0.52277 0.53768 0.55252
##                          PC29   PC30    PC31    PC32   PC33    PC34    PC35
## Standard deviation     1.0395 1.0289 1.02843 1.02735 1.0253 1.02348 1.02147
## Proportion of Variance 0.0148 0.0145 0.01449 0.01446 0.0144 0.01435 0.01429
## Cumulative Proportion  0.5673 0.5818 0.59632 0.61077 0.6252 0.63952 0.65382
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     1.01941 1.01679 1.01476 1.01155 1.00691 1.00612 1.00252
## Proportion of Variance 0.01424 0.01416 0.01411 0.01402 0.01389 0.01387 0.01377
## Cumulative Proportion  0.66805 0.68221 0.69632 0.71034 0.72423 0.73809 0.75186
##                           PC43    PC44    PC45    PC46    PC47    PC48    PC49
## Standard deviation     0.99963 0.99762 0.99556 0.98849 0.98432 0.98018 0.97532
## Proportion of Variance 0.01369 0.01363 0.01358 0.01339 0.01327 0.01316 0.01303
## Cumulative Proportion  0.76555 0.77918 0.79276 0.80614 0.81942 0.83258 0.84561
##                           PC50    PC51   PC52    PC53    PC54    PC55    PC56
## Standard deviation     0.97333 0.96405 0.9629 0.96001 0.93559 0.91099 0.89656
## Proportion of Variance 0.01298 0.01273 0.0127 0.01262 0.01199 0.01137 0.01101
## Cumulative Proportion  0.85859 0.87132 0.8840 0.89664 0.90863 0.92000 0.93101
##                           PC57    PC58    PC59    PC60    PC61    PC62    PC63
## Standard deviation     0.88218 0.84728 0.83169 0.77519 0.69297 0.63749 0.59342
## Proportion of Variance 0.01066 0.00983 0.00948 0.00823 0.00658 0.00557 0.00482
## Cumulative Proportion  0.94168 0.95151 0.96098 0.96922 0.97579 0.98136 0.98619
##                           PC64    PC65    PC66    PC67    PC68    PC69    PC70
## Standard deviation     0.55112 0.46071 0.39800 0.36394 0.25032 0.23827 0.20367
## Proportion of Variance 0.00416 0.00291 0.00217 0.00181 0.00086 0.00078 0.00057
## Cumulative Proportion  0.99035 0.99325 0.99542 0.99724 0.99810 0.99887 0.99944
##                           PC71      PC72      PC73
## Standard deviation     0.20172 2.584e-15 2.325e-15
## Proportion of Variance 0.00056 0.000e+00 0.000e+00
## Cumulative Proportion  1.00000 1.000e+00 1.000e+00
plot(pca, type = "l")

# Choose first 3 PCs
train_pca <- as.data.frame(pca$x[, 1:3])

# Add response variable (assuming Bikes_Counted is in clean_df or df)
train_pca$Bikes_Counted <- clean_df$Bikes_Counted

# Prepare test set:
# Select same variables (numeric + dummies) from test, except Date_Hour
test_base <- test[, !(names(test) %in% c("Date_Hour"))]

# One-hot encode categorical vars in test with model.matrix
test_dummies <- model.matrix(~ . -1, data = test_base[, categorical_vars])

# Numeric vars in test
test_numeric <- test_base[, !(names(test_base) %in% categorical_vars)]

# Combine test numeric + dummies
test_encoded <- cbind(test_numeric, test_dummies)

# Make sure test_encoded has the same columns as clean_df (train encoded)
# If test has missing columns, add them with zeros
missing_cols <- setdiff(names(clean_df), names(test_encoded))
for(col in missing_cols) {
  test_encoded[, col] <- 0
}

# Drop extra columns in test not in train
extra_cols <- setdiff(names(test_encoded), names(clean_df))
if(length(extra_cols) > 0) test_encoded <- test_encoded[, !(names(test_encoded) %in% extra_cols)]

# Order columns to match train
test_encoded <- test_encoded[, names(clean_df)]

# Remove rows with NA in test (optional)
test_encoded <- na.omit(test_encoded)

# Scale test set using train PCA parameters
test_scaled <- scale(test_encoded, center = pca$center, scale = pca$scale)

# Apply PCA rotation on test set (first 3 PCs)
test_pca <- as.data.frame(test_scaled %*% pca$rotation[, 1:3])

# Add response variable to test PCA dataframe
test_pca$Bikes_Counted <- test$Bikes_Counted

At first only numerical variables were used in the PCA. After being scaled 3 PCAs were chosen. Though using the data and realisisng that the categorical varialbes are very influencial in modelling bikes counted, more research was done to get those varaibles involved in the PCA analysis.

All categorical variables were hot encoded as dummies for each category, with these now being able to represent their dimention in the PCA if they are influenceial enough. The scree plot after adding all hot-encoded categorical varaibles still indicated 3 PCAs capturing variance most efficiently, however the whole curve was more smooth, indicating the categorical varaibles did have an effect. Looking at individual dummies’ factor loadings would also confirm this, however there are many.

3 PCAs will be used. Afterwhich the marginal explenation power of the PCAs drops down drastically (See plot).

The rest of the models are copies of the initial ones, applied to 3 PCS. Interestingly these take less time to run, due to the large decrease in data volume needing to be processed.

Linear Regreession (PCA)

Repeat the same procedure as for other lms: 5 iterations of hold-out cross-validation with 80% Training - 20% Testing split:

### Conduct CV analogously to the other linear models
set.seed(0345)
iter_cv <- 5

pca_cv_res_matrix <- matrix(NA, nrow = iter_cv, ncol = (1 * 2 * 3)) # 3 models - 2 splits - 3 metrics

models <- c("LM (PCA)")
splits <- c("Train", "Test")
metrics <- c("MAE","RMSE", "R2")

colnames(pca_cv_res_matrix) <- paste(rep(models, each = 6), 
                                 rep(rep(splits, each = 3)),
                                 rep(metrics, 2),
                                 sep = ", ")

# For hold-out CV: Need entire PC dataset!
whole_pca_dataset <- rbind(train_pca, test_pca)
whole_pca_dataset <- whole_pca_dataset[order(rownames(whole_pca_dataset) |> as.numeric()),]

# Store cross-validation RMSE
for(j in 1:iter_cv) {
  
  # Hold-out CV: always get 80% for training and 20% for testing
  train_sample <- sample(1:nrow(whole_pca_dataset), floor(0.8 * nrow(whole_pca_dataset)))
  
  data_train <- whole_pca_dataset[train_sample,]
  data_test <- whole_pca_dataset[- train_sample,]
  
  # Train lm and find performance on training set
  lm0 <- lm(formula = Bikes_Counted ~ . , data = data_train)
  
  train_pred_0 <- predict(lm0, data_train) |> no_negative_predictions()
  
  # Predict on testing set as well
  test_pred_0 <- predict(lm0, data_test) |> no_negative_predictions()
  

  pca_cv_res_matrix[j, "LM (PCA), Train, MAE"] <- calc_mae(data_train$Bikes_Counted, train_pred_0)
  pca_cv_res_matrix[j, "LM (PCA), Train, RMSE"] <- calc_rmse(data_train$Bikes_Counted, train_pred_0)
  pca_cv_res_matrix[j, "LM (PCA), Train, R2"] <- calc_rsq(data_train$Bikes_Counted, train_pred_0)
    
  pca_cv_res_matrix[j, "LM (PCA), Test, MAE"] <- calc_mae(data_test$Bikes_Counted, test_pred_0)
  pca_cv_res_matrix[j, "LM (PCA), Test, RMSE"] <- calc_rmse(data_test$Bikes_Counted, test_pred_0)
  pca_cv_res_matrix[j, "LM (PCA), Test, R2"] <- calc_rsq(data_test$Bikes_Counted, test_pred_0)
  
  
}

print(pca_cv_res_matrix |> as.data.frame())
##   LM (PCA), Train, MAE LM (PCA), Train, RMSE LM (PCA), Train, R2
## 1             42.81623              75.47941           0.5432904
## 2             42.81649              74.64720           0.5535519
## 3             42.51508              74.09849           0.5488918
## 4             43.89116              76.33699           0.5577187
## 5             42.39546              74.31666           0.5522816
##   LM (PCA), Test, MAE LM (PCA), Test, RMSE LM (PCA), Test, R2
## 1            42.21231             68.20898          0.5840200
## 2            42.16285             71.11037          0.5470057
## 3            41.83485             73.80645          0.5607646
## 4            39.38429             62.28175          0.5344216
## 5            43.57027             72.34487          0.5544129
# cv_res_matrix |> as.data.frame() |> View()

pca_mean_of_all_iters <- apply(pca_cv_res_matrix, 2, mean) 

pca_cv_res_inspection <- pca_cv_res_matrix |> apply(2, mean) |> matrix(nrow = 3) |> t()

colnames(pca_cv_res_inspection) <- c("MAE", "RMSE", "R2")
rownames(pca_cv_res_inspection) <- paste(rep(c("LM (3 PCAs)"), each = 2),
                                         c("Train", "Test"), sep = ": ")


# Bring it beautifully together
cv_res_inspection_final <- rbind(cv_res_inspection, pca_cv_res_inspection) |> data.frame()

Add number of explanatory variables of every model:

cv_res_inspection_final$n_explanatories <- rep(
  c(
    lm_exp |> model.matrix() |> ncol(),
    lm_exp_plus_interaction |> model.matrix() |> ncol(),
    lm_exp_plus_interaction_functforms |> model.matrix() |> ncol(),
    lm0 |> model.matrix() |> ncol() # lm0 still the PC lm from CV
  ),
  each = 2
)

print(cv_res_inspection_final)
##                          MAE     RMSE        R2 n_explanatories
## LM: Train           71.85036 38.03583 0.5779033              71
## LM: Test            69.43331 37.46385 0.5852821              71
## LM + IA: Train      54.90258 27.74394 0.7535397             686
## LM + IA: Test       53.26840 27.56767 0.7558257             686
## LM + IA + FF: Train 53.38733 27.03963 0.7669553             691
## LM + IA + FF: Test  51.87950 26.90295 0.7683806             691
## LM (3 PCAs): Train  42.88688 74.97575 0.5511469               4
## LM (3 PCAs): Test   41.83292 69.55049 0.5561250               4

It can be seen that the model only using the three PCs (+ intercept) as explantory variables is roughly at the same level as the baseline lm which uses about 17 variables – due to factor variables, n_explanatories is higher, because every possible realization (except for one) is calculated an individual estimate for. This is a good point in favor of the PCs. However, the lms including interaction terms and other functional forms are still outperforming the PCA model.

As we see and would have expected, overfitting is not much of a big deal for an OLS model with > 50000 observations, even if we use up to factually 500 explanatory variables in case we include interaction terms and functional forms.

Tree Model (PCA)

library(rpart)

# Train initial tree model on PCA-transformed training data
model_pca <- rpart(Bikes_Counted ~ ., data = train_pca, method = "anova", cp = 0, maxdepth = 5) 
# maxdepth = 5 good balance between training - cv loss gap while not having both increase

# Use cross-validation info to find optimal cp
best_cp_pca <- model_pca$cptable[which.min(model_pca$cptable[, "xerror"]), "CP"]

# Prune the tree
pruned_model_pca <- prune(model_pca, cp = best_cp_pca)

# Predict on PCA-transformed test data
pred_pca <- predict(pruned_model_pca, newdata = test_pca)

# True values
actual_pca <- test_pca$Bikes_Counted

# Metrics
mae_pca <- mean(abs(pred_pca - actual_pca))
rmse_pca <- sqrt(mean((pred_pca - actual_pca)^2))
r2_pca <- 1 - sum((actual_pca - pred_pca)^2) / sum((actual_pca - mean(actual_pca))^2)
metrics_tree_pca <- c(MAE = mae_pca, RMSE = rmse_pca, R2 = r2_pca)

# Training MSE
train_pred_pca <- predict(pruned_model_pca, newdata = train_pca)
train_actual_pca <- train_pca$Bikes_Counted
training_mse_pca <- mean((train_actual_pca - train_pred_pca)^2)

# CV MSE (xerror * variance of response)
cv_mse_pca <- min(model_pca$cptable[, "xerror"]) * var(train_actual_pca)

# Output MSE losses
losses_tree_pca <- c(Training_MSE = training_mse_pca, CV_MSE = cv_mse_pca)
losses_tree_pca
## Training_MSE       CV_MSE 
##     3148.354     4042.605

Ensemble 1: Random Forest (PCA)

library(randomForest)
library(caret)

# Remove Date_Hour from PCA data if present
train_pca <- train_pca[, !(names(train_pca) %in% c("Date_Hour"))]

# Prepare predictors and target
x_pca <- train_pca[, -which(names(train_pca) == "Bikes_Counted")]
y_pca <- train_pca$Bikes_Counted

set.seed(42)
# Tune mtry with OOB error on PCA features
best_mtry_pca <- tuneRF(
  x          = x_pca,
  y          = y_pca,
  mtryStart  = 3,
  stepFactor = 1.5,
  improve    = 0.05,
  ntreeTry   = 100,
  trace      = TRUE,
  plot       = TRUE
)
## mtry = 3  OOB error = 3550.145 
## Searching left ...
## mtry = 2     OOB error = 3530.251 
## 0.005603713 0.05 
## Searching right ...

optimal_mtry <- best_mtry_pca[which.min(best_mtry_pca[, "OOBError"]), "mtry"]

# Setup 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10, savePredictions = "final")

set.seed(42)
# Train random forest with cross-validation
rf_pca_cv <- caret::train(
  Bikes_Counted ~ .,
  data      = train_pca,
  method    = "rf",
  trControl = ctrl,
  tuneGrid  = data.frame(mtry = optimal_mtry),
  ntree     = 500,
  nodesize  = 40,
  maxnodes  = 15,
  importance = TRUE
)

# Extract CV metrics
cv_rmse <- rf_pca_cv$results$RMSE
cv_mse  <- cv_rmse^2

# Final model fitted on full training data
rf_pca <- rf_pca_cv$finalModel

# Training set predictions and MSE
train_preds <- predict(rf_pca, newdata = train_pca)
training_mse_pca <- mean((train_pca$Bikes_Counted - train_preds)^2)

# Prepare test data (remove Date_Hour and Bikes_Counted)
test_pca_data <- test_pca[, !(names(test_pca) %in% c("Date_Hour", "Bikes_Counted"))]

# Test set predictions and metrics
preds <- predict(rf_pca, newdata = test_pca_data)
actuals <- test_pca$Bikes_Counted

mae  <- mean(abs(preds - actuals))
rmse <- sqrt(mean((preds - actuals)^2))
r2   <- 1 - sum((actuals - preds)^2) / sum((actuals - mean(actuals))^2)

# Output metrics
metrics_ens1_pca <- c(MAE = mae, RMSE = rmse, R2 = r2)
losses_rf_pca    <- c(Training_MSE = training_mse_pca, CV_MSE = cv_mse)

losses_rf_pca
## Training_MSE       CV_MSE 
##     3320.574     3614.888

Ensemble 2: Gradient Boosting (PCA)

library(gbm)

# Use PCA-transformed training and test data
train_data_pca <- train_pca
test_data_pca <- test_pca[, -which(names(test_pca) == "Bikes_Counted")]

set.seed(123)
model_gbm_pca <- gbm(
  Bikes_Counted ~ .,
  data = train_data_pca,
  distribution = "gaussian",
  n.trees = 200,
  interaction.depth = 2,
  shrinkage = 0.05,
  cv.folds = 15
)

# Best number of trees based on CV
best_iter_pca <- gbm.perf(model_gbm_pca, method = "cv", plot.it = FALSE)

# Predict on test set
pred_pca <- predict(model_gbm_pca, newdata = test_data_pca, n.trees = best_iter_pca)
actual_pca <- test_pca$Bikes_Counted

# Evaluation metrics
mae_pca <- mean(abs(pred_pca - actual_pca))
rmse_pca <- sqrt(mean((pred_pca - actual_pca)^2))
r2_pca <- 1 - sum((actual_pca - pred_pca)^2) / sum((actual_pca - mean(actual_pca))^2)

# Store test metrics
metrics_ens2_pca <- c(MAE = mae_pca, RMSE = rmse_pca, R2 = r2_pca)

# Training and CV loss over iterations
train_loss_pca <- model_gbm_pca$train.error
val_loss_pca <- model_gbm_pca$cv.error

# Plot training and validation MSE
plot(train_loss_pca, type = "l", col = "grey", lwd = 2, ylab = "MSE", xlab = "Number of Trees")
lines(val_loss_pca, col = "orange", lwd = 2)
legend("topright", legend = c("Training MSE", "CV MSE"), col = c("grey", "orange"), lwd = 2)

# Store selected losses
losses_gbm_pca <- c(Training_MSE = train_loss_pca[best_iter_pca], CV_MSE = val_loss_pca[best_iter_pca])

Neural Network (PCA)

library(keras)

# Prepare PCA-based training and test matrices
x_train_pca <- train_pca[, -which(names(train_pca) == "Bikes_Counted")] %>% as.matrix()
x_test_pca <- test_pca[, -which(names(test_pca) == "Bikes_Counted")] %>% as.matrix()

y_train_pca <- train_pca$Bikes_Counted
y_test_pca <- test_pca$Bikes_Counted

# Build Keras model: Same as the previous NN
model_pca <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = "relu", input_shape = ncol(x_train_pca)) %>%
  layer_dropout(rate = 0.2) %>% # The key to reduce overfitting
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1)

model_pca %>% compile(
  loss = "mse",
  optimizer = optimizer_adam(),
  metrics = list("mae")
)

# Train model
history_pca <- model_pca %>% fit(
  x_train_pca, y_train_pca,
  epochs = 200,
  batch_size = 16,
  validation_split = 0.2,
  verbose = 0
)

plot(history_pca)

# Predict and evaluate
pred_pca <- model_pca %>% predict(x_test_pca) %>% as.vector()
## 35/35 - 0s - 32ms/epoch - 914us/step
mae_pca <- mean(abs(pred_pca - y_test_pca))
rmse_pca <- sqrt(mean((pred_pca - y_test_pca)^2))
r2_pca <- 1 - sum((y_test_pca - pred_pca)^2) / sum((y_test_pca - mean(y_test_pca))^2)

metrics_nn_pca <- c(MAE = mae_pca, RMSE = rmse_pca, R2 = r2_pca)

Model Selection

  • Compare all of your models based on their CV performance. You will have 10 options to consider: five models with and without PCA.
  • Present the results in a table or chart.
  • Estimate the expected loss of your best model on the test set.
# Combine all metric vectors into a list
metrics_list <- list(
  tree = metrics_tree,
  ens1 = metrics_ens1,
  ens2 = metrics_ens2,
  nn = metrics_nn, 
  tree_pca = metrics_tree_pca,
  ens1_pca = metrics_ens1_pca,
  ens2_pca = metrics_ens2_pca,
  nn_pca = metrics_nn_pca  
)

# Convert to data frame
metrics_df <- do.call(rbind, metrics_list)

# Add model names as a column
metrics_df <- data.frame(Model = rownames(metrics_df), metrics_df, row.names = NULL)

# Ensure column names are the same
colnames(metrics_eriks_models) <- colnames(metrics_df)

# Bind the data frames
combined_metrics <- dplyr::bind_rows(metrics_eriks_models, metrics_df)
  
# OR using base R
combined_metrics <- rbind(metrics_eriks_models, metrics_df)

# Adding lm_pca to metrics table
last_row <- cv_res_inspection_final[nrow(cv_res_inspection_final), ]

lm_pca_metrics <- data.frame(
  Model = "LM (3 PCAs)",
  MAE = last_row$MAE,
  RMSE = last_row$RMSE,
  R2 = last_row$R2
)

# Insert at position 8
combined_metrics <- rbind(
  combined_metrics[1:7, ],
  lm_pca_metrics,
  combined_metrics[8:nrow(combined_metrics), ]
)

# Optional: pretty table
library(knitr)
kable(combined_metrics, digits = 3, caption = "Model Performance Metrics")
Model Performance Metrics
Model MAE RMSE R2
1 LM Baseline 43.341 73.001 0.579
2 LM + IA 30.643 55.624 0.755
3 LM + IA + FF 30.021 54.221 0.768
4 tree 29.080 60.502 0.702
5 ens1 45.999 84.918 0.413
6 ens2 26.719 51.144 0.787
7 nn 26.205 52.605 0.775
8 LM (3 PCAs) 41.833 69.550 0.556
81 tree_pca 33.070 52.640 0.774
9 ens1_pca 33.843 56.067 0.744
10 ens2_pca 32.533 52.299 0.777
11 nn_pca 33.849 54.529 0.758

Summary

Bringing it together, we can see that more complex models do not necessarily outperform the linear models when we train them in a way that avoids over-fitting. Thus, the linear model with interaction terms already explains roughly \(\frac{3}{4}\) of the variation of the data. The neural network, despite higher complexity and being the best-performer of the other models, is only moderately better.

The models with PCA have slightly less performance compared to their non-PCA counterparts. This is expected since some of the varicance from all the data is lost. These models do however perform very well considering the larde decrease in varaibles. This shows that PCA is a powerful tool to reduce dimentionality. It was felt when running the models that these were faster to run, and for even larger datasets PCA saves a lot of compute power compares to raw data.

If we were to develop a model for a use case exceeding the scope of the coursework, then including time lags or aggregated data from the day before could be interesting explanatory variables as well, as has been experimentally outlined in the lm-section. This could be extended for the other models.

All in all, the most relevant determinants for bike traffic seem to be ex-ante known criteria, such as the weekday, month (seasonality), whether it’s a public holiday or not, and hours across a day, which is consistent with what one would expect (especially thanks to regular bike commuters). However, weather data to some extent helps to get more precise predictions.